final class DataFrameStatFunctions extends Logging
Provides eagerly computed statistical functions for DataFrames.
To access an object of this class, use DataFrame.stat .
- Since
-
0.2.0
- Alphabetic
- By Inheritance
- DataFrameStatFunctions
- Logging
- AnyRef
- Any
- Hide All
- Show All
- Public
- All
Value Members
-
final
def
!=
(
arg0:
Any
)
:
Boolean
- Definition Classes
- AnyRef → Any
-
final
def
##
()
:
Int
- Definition Classes
- AnyRef → Any
-
final
def
==
(
arg0:
Any
)
:
Boolean
- Definition Classes
- AnyRef → Any
-
def
action
[
T
]
(
funcName:
String
)
(
func: ⇒
T
)
:
T
- Attributes
- protected
- Annotations
- @inline ()
-
def
approxQuantile
(
cols:
Array
[
String
]
,
percentile:
Array
[
Double
]
)
:
Array
[
Array
[
Option
[
Double
]]]
For an array of numeric columns and an array of desired quantiles, returns a matrix of approximate values for each column at each of the desired quantiles.
For an array of numeric columns and an array of desired quantiles, returns a matrix of approximate values for each column at each of the desired quantiles. For example,
result(0)(1)
contains the approximate value for columncols(0)
at quantilepercentile(1)
.This function uses the t-Digest algorithm.
For example, the following code:
import session.implicits._ val df = Seq((0.1, 0.5), (0.2, 0.6), (0.3, 0.7)).toDF("a", "b") val res = double2.stat.approxQuantile(Array("a", "b"), Array(0, 0.1, 0.6))
prints out the following result:
res: Array(Array(Some(0.05), Some(0.15000000000000002), Some(0.25)), Array(Some(0.45), Some(0.55), Some(0.6499999999999999)))
- cols
-
An array of column names.
- percentile
-
An array of double values greater than or equal to 0.0 and less than 1.0.
- returns
-
A matrix with the dimensions
(cols.size * percentile.size)
containing the approximate percentile values. If there is not enough data to calculate the quantile, the method returns None.
- Since
-
0.2.0
-
def
approxQuantile
(
col:
String
,
percentile:
Array
[
Double
]
)
:
Array
[
Option
[
Double
]]
For a specified numeric column and an array of desired quantiles, returns an approximate value for the column at each of the desired quantiles.
For a specified numeric column and an array of desired quantiles, returns an approximate value for the column at each of the desired quantiles.
This function uses the t-Digest algorithm.
For example, the following code:
import session.implicits._ val df = Seq(1, 2, 3, 4, 5, 6, 7, 8, 9, 0).toDF("a") val res = df.stat.approxQuantile("a", Array(0, 0.1, 0.4, 0.6, 1))
prints out the following result:
res: Array(Some(-0.5), Some(0.5), Some(3.5), Some(5.5), Some(9.5))
- col
-
The name of the numeric column.
- percentile
-
An array of double values greater than or equal to 0.0 and less than 1.0.
- returns
-
An array of approximate percentile values, If there is not enough data to calculate the quantile, the method returns None.
- Since
-
0.2.0
-
final
def
asInstanceOf
[
T0
]
:
T0
- Definition Classes
- Any
-
def
clone
()
:
AnyRef
- Attributes
- protected[ lang ]
- Definition Classes
- AnyRef
- Annotations
- @throws ( ... ) @native () @HotSpotIntrinsicCandidate ()
-
def
corr
(
col1:
String
,
col2:
String
)
:
Option
[
Double
]
Calculates the correlation coefficient for non-null pairs in two numeric columns.
Calculates the correlation coefficient for non-null pairs in two numeric columns.
For example, the following code:
import session.implicits._ val df = Seq((0.1, 0.5), (0.2, 0.6), (0.3, 0.7)).toDF("a", "b") double res = df.stat.corr("a", "b").get
prints out the following result:
res: 0.9999999999999991
- col1
-
The name of the first numeric column to use.
- col2
-
The name of the second numeric column to use.
- returns
-
The correlation of the two numeric columns. If there is not enough data to generate the correlation, the method returns None.
- Since
-
0.2.0
-
def
cov
(
col1:
String
,
col2:
String
)
:
Option
[
Double
]
Calculates the sample covariance for non-null pairs in two numeric columns.
Calculates the sample covariance for non-null pairs in two numeric columns.
For example, the following code:
import session.implicits._ val df = Seq((0.1, 0.5), (0.2, 0.6), (0.3, 0.7)).toDF("a", "b") double res = df.stat.cov("a", "b").get
prints out the following result:
res: 0.010000000000000037
- col1
-
The name of the first numeric column to use.
- col2
-
The name of the second numeric column to use.
- returns
-
The sample covariance of the two numeric columns, If there is not enough data to generate the covariance, the method returns None.
- Since
-
0.2.0
-
def
crosstab
(
col1:
String
,
col2:
String
)
:
DataFrame
Computes a pair-wise frequency table (a contingency table ) for the specified columns.
Computes a pair-wise frequency table (a contingency table ) for the specified columns. The method returns a DataFrame containing this table.
In the returned contingency table:
-
The first column of each row contains the distinct values of
col1
. -
The name of the first column is the name of
col1
. -
The rest of the column names are the distinct values of
col2
. - The counts are returned as Longs.
- For pairs that have no occurrences, the contingency table contains 0 as the count.
Note: The number of distinct values in
col2
should not exceed 1000.For example, the following code:
import session.implicits._ val df = Seq((1, 1), (1, 2), (2, 1), (2, 1), (2, 3), (3, 2), (3, 3)).toDF("key", "value") val ct = df.stat.crosstab("key", "value") ct.show()
prints out the following result:
--------------------------------------------------------------------------------------------- |"KEY" |"CAST(1 AS NUMBER(38,0))" |"CAST(2 AS NUMBER(38,0))" |"CAST(3 AS NUMBER(38,0))" | --------------------------------------------------------------------------------------------- |1 |1 |1 |0 | |2 |2 |0 |1 | |3 |0 |1 |1 | ---------------------------------------------------------------------------------------------
- col1
-
The name of the first column to use.
- col2
-
The name of the second column to use.
- returns
-
A DataFrame containing the contingency table.
- Since
-
0.2.0
-
The first column of each row contains the distinct values of
-
final
def
eq
(
arg0:
AnyRef
)
:
Boolean
- Definition Classes
- AnyRef
-
def
equals
(
arg0:
Any
)
:
Boolean
- Definition Classes
- AnyRef → Any
-
final
def
getClass
()
:
Class
[_]
- Definition Classes
- AnyRef → Any
- Annotations
- @native () @HotSpotIntrinsicCandidate ()
-
def
hashCode
()
:
Int
- Definition Classes
- AnyRef → Any
- Annotations
- @native () @HotSpotIntrinsicCandidate ()
-
final
def
isInstanceOf
[
T0
]
:
Boolean
- Definition Classes
- Any
-
def
log
()
:
Logger
- Attributes
- protected[ internal ]
- Definition Classes
- Logging
-
def
logDebug
(
msg:
String
,
throwable:
Throwable
)
:
Unit
- Attributes
- protected[ internal ]
- Definition Classes
- Logging
-
def
logDebug
(
msg:
String
)
:
Unit
- Attributes
- protected[ internal ]
- Definition Classes
- Logging
-
def
logError
(
msg:
String
,
throwable:
Throwable
)
:
Unit
- Attributes
- protected[ internal ]
- Definition Classes
- Logging
-
def
logError
(
msg:
String
)
:
Unit
- Attributes
- protected[ internal ]
- Definition Classes
- Logging
-
def
logInfo
(
msg:
String
,
throwable:
Throwable
)
:
Unit
- Attributes
- protected[ internal ]
- Definition Classes
- Logging
-
def
logInfo
(
msg:
String
)
:
Unit
- Attributes
- protected[ internal ]
- Definition Classes
- Logging
-
def
logTrace
(
msg:
String
,
throwable:
Throwable
)
:
Unit
- Attributes
- protected[ internal ]
- Definition Classes
- Logging
-
def
logTrace
(
msg:
String
)
:
Unit
- Attributes
- protected[ internal ]
- Definition Classes
- Logging
-
def
logWarning
(
msg:
String
,
throwable:
Throwable
)
:
Unit
- Attributes
- protected[ internal ]
- Definition Classes
- Logging
-
def
logWarning
(
msg:
String
)
:
Unit
- Attributes
- protected[ internal ]
- Definition Classes
- Logging
-
final
def
ne
(
arg0:
AnyRef
)
:
Boolean
- Definition Classes
- AnyRef
-
final
def
notify
()
:
Unit
- Definition Classes
- AnyRef
- Annotations
- @native () @HotSpotIntrinsicCandidate ()
-
final
def
notifyAll
()
:
Unit
- Definition Classes
- AnyRef
- Annotations
- @native () @HotSpotIntrinsicCandidate ()
-
def
sampleBy
[
T
]
(
col:
String
,
fractions:
Map
[
T
,
Double
]
)
:
DataFrame
Returns a DataFrame containing a stratified sample without replacement, based on a Map that specifies the fraction for each stratum.
Returns a DataFrame containing a stratified sample without replacement, based on a Map that specifies the fraction for each stratum.
For example, the following code:
import session.implicits._ val df = Seq(("Bob", 17), ("Alice", 10), ("Nico", 8), ("Bob", 12)).toDF("name", "age") val fractions = Map("Bob" -> 0.5, "Nico" -> 1.0) df.stat.sampleBy("name", fractions).show()
prints out the following result:
------------------ |"NAME" |"AGE" | ------------------ |Bob |17 | |Nico |8 | ------------------
- T
-
The type of the stratum.
- col
-
The name of the column that defines the strata.
- fractions
-
A Map that specifies the fraction to use for the sample for each stratum. If a stratum is not specified in the Map, the method uses 0 as the fraction.
- returns
-
A new DataFrame that contains the stratified sample.
- Since
-
0.2.0
-
def
sampleBy
[
T
]
(
col:
Column
,
fractions:
Map
[
T
,
Double
]
)
:
DataFrame
Returns a DataFrame containing a stratified sample without replacement, based on a Map that specifies the fraction for each stratum.
Returns a DataFrame containing a stratified sample without replacement, based on a Map that specifies the fraction for each stratum.
For example, the following code:
import session.implicits._ val df = Seq(("Bob", 17), ("Alice", 10), ("Nico", 8), ("Bob", 12)).toDF("name", "age") val fractions = Map("Bob" -> 0.5, "Nico" -> 1.0) df.stat.sampleBy(col("name"), fractions).show()
prints out the following result:
------------------ |"NAME" |"AGE" | ------------------ |Bob |17 | |Nico |8 | ------------------
- T
-
The type of the stratum.
- col
-
An expression for the column that defines the strata.
- fractions
-
A Map that specifies the fraction to use for the sample for each stratum. If a stratum is not specified in the Map, the method uses 0 as the fraction.
- returns
-
A new DataFrame that contains the stratified sample.
- Since
-
0.2.0
-
final
def
synchronized
[
T0
]
(
arg0: ⇒
T0
)
:
T0
- Definition Classes
- AnyRef
-
def
toString
()
:
String
- Definition Classes
- AnyRef → Any
-
def
transformation
(
funcName:
String
)
(
func: ⇒
DataFrame
)
:
DataFrame
- Attributes
- protected
- Annotations
- @inline ()
-
final
def
wait
(
arg0:
Long
,
arg1:
Int
)
:
Unit
- Definition Classes
- AnyRef
- Annotations
- @throws ( ... )
-
final
def
wait
(
arg0:
Long
)
:
Unit
- Definition Classes
- AnyRef
- Annotations
- @throws ( ... ) @native ()
-
final
def
wait
()
:
Unit
- Definition Classes
- AnyRef
- Annotations
- @throws ( ... )
Deprecated Value Members
-
def
finalize
()
:
Unit
- Attributes
- protected[ lang ]
- Definition Classes
- AnyRef
- Annotations
- @throws ( classOf[java.lang.Throwable] ) @Deprecated
- Deprecated