*, df: DataFrame, columns: Optional[Collection[str]] = None) DataFrame

Pearson correlation matrix for the columns in a snowpark dataframe. NaNs and Nulls are not ignored, i.e. correlation on columns containing NaN or Null results in NaN correlation values. Returns a pandas dataframe containing the correlation matrix.

The below steps explain how correlation matrix is computed in a distributed way: Let n = # of rows in the dataframe; sqrt_n = sqrt(n); X, Y are 2 columns in the dataframe Correlation(X, Y) = numerator/denominator where numerator = dot(X/sqrt_n, Y/sqrt_n) - sum(X/n)*sum(X/n) denominator = std_dev(X)*std_dev(Y) std_dev(X) = sqrt(dot(X/sqrt_n, X/sqrt_n) - sum(X/n)*sum(X/n))

Note that the formula is entirely written using dot and sum operators on columns. Using first UDTF, we compute the dot and sum of columns for different shards in parallel. In the second UDTF, dot and sum is accumulated from all the shards. The final computation for numerator, denominator and division is done on the client side as a post-processing step.

  • df – snowpark.DataFrame Snowpark Dataframe for which correlation matrix has to be computed.

  • columns – List of strings List of column names for which the correlation matrix has to be computed. If None, correlation matrix is computed for all numeric columns in the snowpark dataframe.


Correlation matrix in pandas.DataFrame format.