Class DataFrameStatFunctions


  • public class DataFrameStatFunctions
    extends java.lang.Object
    Provides eagerly computed statistical functions for DataFrames.

    To access an object of this class, use DataFrame.stat().

    Since:
    1.1.0
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      java.util.Optional<java.lang.Double>[][] approxQuantile​(java.lang.String[] cols, double[] percentile)
      For an array of numeric columns and an array of desired quantiles, returns a matrix of approximate values for each column at each of the desired quantiles.
      java.util.Optional<java.lang.Double>[] approxQuantile​(java.lang.String col, double[] percentile)
      For a specified numeric column and an array of desired quantiles, returns an approximate value for the column at each of the desired quantiles.
      java.util.Optional<java.lang.Double> corr​(java.lang.String col1, java.lang.String col2)
      Calculates the correlation coefficient for non-null pairs in two numeric columns.
      java.util.Optional<java.lang.Double> cov​(java.lang.String col1, java.lang.String col2)
      Calculates the sample covariance for non-null pairs in two numeric columns.
      DataFrame crosstab​(java.lang.String col1, java.lang.String col2)
      Computes a pair-wise frequency table (a ''contingency table'') for the specified columns.
      DataFrame sampleBy​(Column col, java.util.Map<?,​java.lang.Double> fractions)
      Returns a DataFrame containing a stratified sample without replacement, based on a Map that specifies the fraction for each stratum.
      DataFrame sampleBy​(java.lang.String colName, java.util.Map<?,​java.lang.Double> fractions)
      Returns a DataFrame containing a stratified sample without replacement, based on a Map that specifies the fraction for each stratum.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Method Detail

      • corr

        public java.util.Optional<java.lang.Double> corr​(java.lang.String col1,
                                                         java.lang.String col2)
        Calculates the correlation coefficient for non-null pairs in two numeric columns.
        Parameters:
        col1 - The name of the first numeric column to use.
        col2 - The name of the second numeric column to use.
        Returns:
        The correlation of the two numeric columns. If there is not enough data to generate the correlation, the method returns None.
        Since:
        1.1.0
      • cov

        public java.util.Optional<java.lang.Double> cov​(java.lang.String col1,
                                                        java.lang.String col2)
        Calculates the sample covariance for non-null pairs in two numeric columns.
        Parameters:
        col1 - The name of the first numeric column to use.
        col2 - The name of the second numeric column to use.
        Returns:
        The sample covariance of the two numeric columns, If there is not enough data to generate the covariance, the method returns None.
        Since:
        1.1.0
      • approxQuantile

        public java.util.Optional<java.lang.Double>[] approxQuantile​(java.lang.String col,
                                                                     double[] percentile)
        For a specified numeric column and an array of desired quantiles, returns an approximate value for the column at each of the desired quantiles.

        This function uses the t-Digest algorithm.

        Parameters:
        col - The name of the numeric column.
        percentile - An array of double values greater than or equal to 0.0 and less than 1.0.
        Returns:
        An array of approximate percentile values, If there is not enough data to calculate the quantile, the method returns None.
        Since:
        1.1.0
      • approxQuantile

        public java.util.Optional<java.lang.Double>[][] approxQuantile​(java.lang.String[] cols,
                                                                       double[] percentile)
        For an array of numeric columns and an array of desired quantiles, returns a matrix of approximate values for each column at each of the desired quantiles. For example, `result(0)(1)` contains the approximate value for column `cols(0)` at quantile `percentile(1)`.

        This function uses the t-Digest algorithm.

        Parameters:
        cols - An array of column names.
        percentile - An array of double values greater than or equal to 0.0 and less than 1.0.
        Returns:
        A matrix with the dimensions `(cols.size * percentile.size)` containing the approximate percentile values. If there is not enough data to calculate the quantile, the method returns None.
        Since:
        1.1.0
      • crosstab

        public DataFrame crosstab​(java.lang.String col1,
                                  java.lang.String col2)
        Computes a pair-wise frequency table (a ''contingency table'') for the specified columns. The method returns a DataFrame containing this table.

        In the returned contingency table:

        - The first column of each row contains the distinct values of col1. - The name of the first column is the name of col1. - The rest of the column names are the distinct values of col2. - The counts are returned as Longs. - For pairs that have no occurrences, the contingency table contains 0 as the count.

        Note: The number of distinct values in col2 should not exceed 1000.

        Parameters:
        col1 - The name of the first column to use.
        col2 - The name of the second column to use.
        Returns:
        A DataFrame containing the contingency table.
        Since:
        1.1.0
      • sampleBy

        public DataFrame sampleBy​(Column col,
                                  java.util.Map<?,​java.lang.Double> fractions)
        Returns a DataFrame containing a stratified sample without replacement, based on a Map that specifies the fraction for each stratum.
        Parameters:
        col - An expression for the column that defines the strata.
        fractions - A Map that specifies the fraction to use for the sample for each stratum. If a stratum is not specified in the Map, the method uses 0 as the fraction.
        Returns:
        A new DataFrame that contains the stratified sample.
        Since:
        1.1.0
      • sampleBy

        public DataFrame sampleBy​(java.lang.String colName,
                                  java.util.Map<?,​java.lang.Double> fractions)
        Returns a DataFrame containing a stratified sample without replacement, based on a Map that specifies the fraction for each stratum.
        Parameters:
        colName - The name of the column that defines the strata.
        fractions - A Map that specifies the fraction to use for the sample for each stratum. If a stratum is not specified in the Map, the method uses 0 as the fraction.
        Returns:
        A new DataFrame that contains the stratified sample.
        Since:
        1.1.0