snowflake.ml.modeling.metrics.covariance

snowflake.ml.modeling.metrics.covariance(*, df: DataFrame, columns: Optional[Collection[str]] = None, ddof: int = 1) DataFrame

Covariance matrix for the columns in a snowpark dataframe. NaNs and Nulls are not ignored, i.e. covariance on columns containing NaN or Null results in NaN covariance values. Returns a pandas dataframe containing the covariance matrix.

The below steps explain how covariance matrix is computed in a distributed way: Let n = # of rows in the dataframe; ddof = delta degree of freedom; X, Y are 2 columns in the dataframe Covariance(X, Y) = term1 - term2 - term3 + term4 where term1 = dot(X/sqrt(n-ddof), Y/sqrt(n-ddof)) term2 = sum(Y/n)*sum(X/(n-ddof)) term3 = sum(X/n)*sum(Y/(n-ddof)) term4 = (n/(n-ddof))*sum(X/n)*sum(Y/n)

Note that the formula is entirely written using dot and sum operators on columns. Using first UDTF, we compute the dot and sum of columns for different shards in parallel. In the second UDTF, dot and sum is accumulated from all the shards. The final computation for covariance matrix is done on the client side as a post-processing step.

Parameters:
  • df – snowpark.DataFrame Snowpark Dataframe for which covariance matrix has to be computed.

  • columns – list of strings, default=None List of column names for which the covariance matrix has to be computed. If None, covariance matrix is computed for all numeric columns in the snowpark dataframe.

  • ddof – int, default=1 Delta degrees of freedom. The divisor used in calculations is N - ddof, where N represents the number of rows.

Returns:

Covariance matrix in pandas.DataFrame format.