Class DataFrameStatFunctions


  • public class DataFrameStatFunctions
    extends Object
    Provides eagerly computed statistical functions for DataFrames.

    To access an object of this class, use DataFrame.stat().

    Since:
    1.1.0
    • Method Detail

      • corr

        public Optional<Double> corr​(String col1,
                                     String col2)
        Calculates the correlation coefficient for non-null pairs in two numeric columns.
        Parameters:
        col1 - The name of the first numeric column to use.
        col2 - The name of the second numeric column to use.
        Returns:
        The correlation of the two numeric columns. If there is not enough data to generate the correlation, the method returns None.
        Since:
        1.1.0
      • cov

        public Optional<Double> cov​(String col1,
                                    String col2)
        Calculates the sample covariance for non-null pairs in two numeric columns.
        Parameters:
        col1 - The name of the first numeric column to use.
        col2 - The name of the second numeric column to use.
        Returns:
        The sample covariance of the two numeric columns, If there is not enough data to generate the covariance, the method returns None.
        Since:
        1.1.0
      • approxQuantile

        public Optional<Double>[] approxQuantile​(String col,
                                                 double[] percentile)
        For a specified numeric column and an array of desired quantiles, returns an approximate value for the column at each of the desired quantiles.

        This function uses the t-Digest algorithm.

        Parameters:
        col - The name of the numeric column.
        percentile - An array of double values greater than or equal to 0.0 and less than 1.0.
        Returns:
        An array of approximate percentile values, If there is not enough data to calculate the quantile, the method returns None.
        Since:
        1.1.0
      • approxQuantile

        public Optional<Double>[][] approxQuantile​(String[] cols,
                                                   double[] percentile)
        For an array of numeric columns and an array of desired quantiles, returns a matrix of approximate values for each column at each of the desired quantiles. For example, `result(0)(1)` contains the approximate value for column `cols(0)` at quantile `percentile(1)`.

        This function uses the t-Digest algorithm.

        Parameters:
        cols - An array of column names.
        percentile - An array of double values greater than or equal to 0.0 and less than 1.0.
        Returns:
        A matrix with the dimensions `(cols.size * percentile.size)` containing the approximate percentile values. If there is not enough data to calculate the quantile, the method returns None.
        Since:
        1.1.0
      • crosstab

        public DataFrame crosstab​(String col1,
                                  String col2)
        Computes a pair-wise frequency table (a ''contingency table'') for the specified columns. The method returns a DataFrame containing this table.

        In the returned contingency table:

        - The first column of each row contains the distinct values of col1. - The name of the first column is the name of col1. - The rest of the column names are the distinct values of col2. - The counts are returned as Longs. - For pairs that have no occurrences, the contingency table contains 0 as the count.

        Note: The number of distinct values in col2 should not exceed 1000.

        Parameters:
        col1 - The name of the first column to use.
        col2 - The name of the second column to use.
        Returns:
        A DataFrame containing the contingency table.
        Since:
        1.1.0
      • sampleBy

        public DataFrame sampleBy​(Column col,
                                  Map<?,​Double> fractions)
        Returns a DataFrame containing a stratified sample without replacement, based on a Map that specifies the fraction for each stratum.
        Parameters:
        col - An expression for the column that defines the strata.
        fractions - A Map that specifies the fraction to use for the sample for each stratum. If a stratum is not specified in the Map, the method uses 0 as the fraction.
        Returns:
        A new DataFrame that contains the stratified sample.
        Since:
        1.1.0
      • sampleBy

        public DataFrame sampleBy​(String colName,
                                  Map<?,​Double> fractions)
        Returns a DataFrame containing a stratified sample without replacement, based on a Map that specifies the fraction for each stratum.
        Parameters:
        colName - The name of the column that defines the strata.
        fractions - A Map that specifies the fraction to use for the sample for each stratum. If a stratum is not specified in the Map, the method uses 0 as the fraction.
        Returns:
        A new DataFrame that contains the stratified sample.
        Since:
        1.1.0