class CopyableDataFrame extends DataFrame
DataFrame for loading data from files in a stage to a table. Objects of this type are returned by the DataFrameReader methods that load data from files (e.g. csv ).
To save the data from the staged files to a table, call the
copyInto()
methods.
This method uses the COPY INTO
<table_name>
command to copy the data to a specified table.
- Since
-
0.9.0
- Grouped
- Alphabetic
- By Inheritance
- CopyableDataFrame
- DataFrame
- Logging
- AnyRef
- Any
- Hide All
- Show All
- Public
- All
Value Members
-
final
def
!=
(
arg0:
Any
)
:
Boolean
- Definition Classes
- AnyRef → Any
-
final
def
##
()
:
Int
- Definition Classes
- AnyRef → Any
-
final
def
==
(
arg0:
Any
)
:
Boolean
- Definition Classes
- AnyRef → Any
-
def
agg
(
exprs:
Array
[
Column
]
)
:
DataFrame
Aggregate the data in the DataFrame.
Aggregate the data in the DataFrame. Use this method if you don't need to group the data (
groupBy
).For the input value, pass in expressions that apply aggregation functions to columns (functions that are defined in the functions object).
The following example calculates the maximum value of the
num_sales
column and the mean value of theprice
column:For example:
import com.snowflake.snowpark.functions._ val dfAgg = df.agg(Array(max($"num_sales"), mean($"price")))
- exprs
-
An array of expressions on columns.
- returns
- Definition Classes
- DataFrame
- Since
-
0.7.0
-
def
agg
[
T
]
(
exprs:
Seq
[
Column
]
)
(
implicit
arg0:
ClassTag
[
T
]
)
:
DataFrame
Aggregate the data in the DataFrame.
Aggregate the data in the DataFrame. Use this method if you don't need to group the data (
groupBy
).For the input value, pass in expressions that apply aggregation functions to columns (functions that are defined in the functions object).
The following example calculates the maximum value of the
num_sales
column and the mean value of theprice
column:import com.snowflake.snowpark.functions._ val dfAgg = df.agg(Seq(max($"num_sales"), mean($"price")))
- exprs
-
A list of expressions on columns.
- returns
- Definition Classes
- DataFrame
- Since
-
0.2.0
-
def
agg
(
expr:
Column
,
exprs:
Column
*
)
:
DataFrame
Aggregate the data in the DataFrame.
Aggregate the data in the DataFrame. Use this method if you don't need to group the data (
groupBy
).For the input value, pass in expressions that apply aggregation functions to columns (functions that are defined in the functions object).
The following example calculates the maximum value of the
num_sales
column and the mean value of theprice
column:For example:
import com.snowflake.snowpark.functions._ val dfAgg = df.agg(max($"num_sales"), mean($"price"))
- expr
-
A list of expressions on columns.
- returns
- Definition Classes
- DataFrame
- Since
-
0.1.0
-
def
agg
(
exprs:
Seq
[(
String
,
String
)]
)
:
DataFrame
Aggregate the data in the DataFrame.
Aggregate the data in the DataFrame. Use this method if you don't need to group the data (
groupBy
).For the input, pass in a Map that specifies the column names and aggregation functions. For each pair in the Map:
- Set the key to the name of the column to aggregate.
- Set the value to the name of the aggregation function to use on that column.
The following example calculates the maximum value of the
num_sales
column and the average value of theprice
column:val dfAgg = df.agg(Seq("num_sales" -> "max", "price" -> "mean"))
This is equivalent to calling
agg
after callinggroupBy
without a column name:val dfAgg = df.groupBy().agg(Seq(df("num_sales") -> "max", df("price") -> "mean"))
- exprs
-
A map of column names and aggregate functions.
- returns
- Definition Classes
- DataFrame
- Since
-
0.2.0
-
def
agg
(
expr: (
String
,
String
)
,
exprs: (
String
,
String
)*
)
:
DataFrame
Aggregate the data in the DataFrame.
Aggregate the data in the DataFrame. Use this method if you don't need to group the data (
groupBy
).For the input, pass in a Map that specifies the column names and aggregation functions. For each pair in the Map:
- Set the key to the name of the column to aggregate.
- Set the value to the name of the aggregation function to use on that column.
The following example calculates the maximum value of the
num_sales
column and the average value of theprice
column:val dfAgg = df.agg("num_sales" -> "max", "price" -> "mean")
This is equivalent to calling
agg
after callinggroupBy
without a column name:val dfAgg = df.groupBy().agg(df("num_sales") -> "max", df("price") -> "mean")
- expr
-
A map of column names and aggregate functions.
- returns
- Definition Classes
- DataFrame
- Since
-
0.1.0
-
def
alias
(
alias:
String
)
:
DataFrame
Returns the current DataFrame aliased as the input alias name.
-
def
apply
(
colName:
String
)
:
Column
Returns a reference to a column in the DataFrame.
Returns a reference to a column in the DataFrame. This method is identical to DataFrame.col .
- colName
-
The name of the column.
- returns
-
A Column
- Definition Classes
- DataFrame
- Since
-
0.1.0
-
final
def
asInstanceOf
[
T0
]
:
T0
- Definition Classes
- Any
-
def
async
:
CopyableDataFrameAsyncActor
Returns a CopyableDataFrameAsyncActor object that can be used to execute CopyableDataFrame actions asynchronously.
Returns a CopyableDataFrameAsyncActor object that can be used to execute CopyableDataFrame actions asynchronously.
Example:
val asyncJob = session.read.schema(userSchema).csv(testFileOnStage).async.collect() // At this point, the thread is not blocked. You can perform additional work before // calling asyncJob.getResult() to retrieve the results of the action. // NOTE: getResult() is a blocking call. asyncJob.getResult()
- returns
-
A CopyableDataFrameAsyncActor object
- Definition Classes
- CopyableDataFrame → DataFrame
- Since
-
0.11.0
-
def
cacheResult
()
:
HasCachedResult
Caches the content of this DataFrame to create a new cached DataFrame.
Caches the content of this DataFrame to create a new cached DataFrame.
All subsequent operations on the returned cached DataFrame are performed on the cached data and have no effect on the original DataFrame.
- returns
- Definition Classes
- DataFrame
- Since
-
0.4.0
-
def
clone
()
:
CopyableDataFrame
Returns a clone of this CopyableDataFrame.
Returns a clone of this CopyableDataFrame.
- returns
- Definition Classes
- CopyableDataFrame → DataFrame → AnyRef
- Since
-
0.10.0
-
def
col
(
colName:
String
)
:
Column
Returns a reference to a column in the DataFrame.
-
def
collect
()
:
Array
[
Row
]
Executes the query representing this DataFrame and returns the result as an Array of Row objects.
-
def
copyInto
(
tableName:
String
,
targetColumnNames:
Seq
[
String
]
,
transformations:
Seq
[
Column
]
,
options:
Map
[
String
,
Any
]
)
:
Unit
Executes a
COPY INTO <table_name>
command with the specified transformations and options to load data from files in a stage into a specified table.Executes a
COPY INTO <table_name>
command with the specified transformations and options to load data from files in a stage into a specified table.copyInto is an action method (like the collect method), so calling the method executes the SQL statement to copy the data.
In addition, you can specify format type options or copy options that determine how the copy operation should be performed.
When copying the data into the table, you can apply transformations to the data from the files to:
- Rename the columns
- Change the order of the columns
- Omit or insert columns
- Cast the value in a column to a specific type
You can use the same techniques described in Transforming Data During Load expressed as a
Seq
of Column expressions that correspond to the SELECT statement parameters in theCOPY INTO <table_name>
command.You can specify a subset of the table columns to copy into. The number of provided column names must match the number of transformations.
For example, suppose the target table
T
has 3 columns: "ID", "A" and "A_LEN". "ID" is anAUTOINCREMENT
column, which should be exceluded from this copy into action. The following code loads data from the path specified bymyFileStage
to the tableT
. The example transforms the data from the file by inserting the value of the first column into the columnA
and inserting the length of that value into the columnA_LEN
. The example also uses aMap
to set theFORCE
andskip_header
options for the copy operation.import com.snowflake.snowpark.functions._ val df = session.read.schema(userSchema).option("skip_header", 1).csv(myFileStage) val transformations = Seq(col("$1"), length(col("$1"))) val targetColumnNames = Seq("A", "A_LEN") val extraOptions = Map("FORCE" -> "true", "skip_header" -> 2) df.copyInto("T", targetColumnNames, transformations, extraOptions)
- tableName
-
Name of the table where the data should be saved.
- targetColumnNames
-
Name of the columns in the table where the data should be saved.
- transformations
-
Seq of Column expressions that specify the transformations to apply (similar to transformation parameters ).
- options
-
Map of the names of options (e.g. { @code compression}, { @code skip_header}, etc.) and their corresponding values.NOTE: By default, the
CopyableDataFrame
object uses the options set in the DataFrameReader used to create that object. You can use thisoptions
parameter to override the default options or set additional options.
- Since
-
0.11.0
-
def
copyInto
(
tableName:
String
,
transformations:
Seq
[
Column
]
,
options:
Map
[
String
,
Any
]
)
:
Unit
Executes a
COPY INTO <table_name>
command with the specified transformations and options to load data from files in a stage into a specified table.Executes a
COPY INTO <table_name>
command with the specified transformations and options to load data from files in a stage into a specified table.copyInto is an action method (like the collect method), so calling the method executes the SQL statement to copy the data.
In addition, you can specify format type options or copy options that determine how the copy operation should be performed.
When copying the data into the table, you can apply transformations to the data from the files to:
- Rename the columns
- Change the order of the columns
- Omit or insert columns
- Cast the value in a column to a specific type
You can use the same techniques described in Transforming Data During Load expressed as a
Seq
of Column expressions that correspond to the SELECT statement parameters in theCOPY INTO <table_name>
command.For example, the following code loads data from the path specified by
myFileStage
to the tableT
. The example transforms the data from the file by inserting the value of the first column into the first column of tableT
and inserting the length of that value into the second column of tableT
. The example also uses aMap
to set theFORCE
andskip_header
options for the copy operation.import com.snowflake.snowpark.functions._ val df = session.read.schema(userSchema).option("skip_header", 1).csv(myFileStage) val transformations = Seq(col("$1"), length(col("$1"))) val extraOptions = Map("FORCE" -> "true", "skip_header" -> 2) df.copyInto("T", transformations, extraOptions)
- tableName
-
Name of the table where the data should be saved.
- transformations
-
Seq of Column expressions that specify the transformations to apply (similar to transformation parameters ).
- options
-
Map of the names of options (e.g. { @code compression}, { @code skip_header}, etc.) and their corresponding values.NOTE: By default, the
CopyableDataFrame
object uses the options set in the DataFrameReader used to create that object. You can use thisoptions
parameter to override the default options or set additional options.
- Since
-
0.9.0
-
def
copyInto
(
tableName:
String
,
transformations:
Seq
[
Column
]
)
:
Unit
Executes a
COPY INTO <table_name>
command with the specified transformations to load data from files in a stage into a specified table.Executes a
COPY INTO <table_name>
command with the specified transformations to load data from files in a stage into a specified table.copyInto is an action method (like the collect method), so calling the method executes the SQL statement to copy the data.
When copying the data into the table, you can apply transformations to the data from the files to:
- Rename the columns
- Change the order of the columns
- Omit or insert columns
- Cast the value in a column to a specific type
You can use the same techniques described in Transforming Data During Load expressed as a
Seq
of Column expressions that correspond to the SELECT statement parameters in theCOPY INTO <table_name>
command.For example, the following code loads data from the path specified by
myFileStage
to the tableT
. The example transforms the data from the file by inserting the value of the first column into the first column of tableT
and inserting the length of that value into the second column of tableT
.import com.snowflake.snowpark.functions._ val df = session.read.schema(userSchema).csv(myFileStage) val transformations = Seq(col("$1"), length(col("$1"))) df.copyInto("T", transformations)
- tableName
-
Name of the table where the data should be saved.
- transformations
-
Seq of Column expressions that specify the transformations to apply (similar to transformation parameters ).
- Since
-
0.9.0
-
def
copyInto
(
tableName:
String
)
:
Unit
Executes a
COPY INTO <table_name>
command to load data from files in a stage into a specified table.Executes a
COPY INTO <table_name>
command to load data from files in a stage into a specified table.copyInto is an action method (like the collect method), so calling the method executes the SQL statement to copy the data.
For example, the following code loads data from the path specified by
myFileStage
to the tableT
:val df = session.read.schema(userSchema).csv(myFileStage) df.copyInto("T")
- tableName
-
Name of the table where the data should be saved.
- Since
-
0.9.0
-
def
count
()
:
Long
Executes the query representing this DataFrame and returns the number of rows in the result (similar to the COUNT function in SQL).
Executes the query representing this DataFrame and returns the number of rows in the result (similar to the COUNT function in SQL).
- returns
-
The number of rows.
- Definition Classes
- DataFrame
- Since
-
0.1.0
-
def
createOrReplaceTempView
(
multipartIdentifier:
List
[
String
]
)
:
Unit
Creates a temporary view that returns the same results as this DataFrame.
Creates a temporary view that returns the same results as this DataFrame.
You can use the view in subsequent SQL queries and statements during the current session. The temporary view is only available in the session in which it is created.
In
multipartIdentifer
, you can include the database and schema name to specify a fully-qualified name. If no database name or schema name are specified, the view will be created in the current database or schema.The view name must be a valid Snowflake identifier .
- multipartIdentifier
-
A list of strings that specify the database name, schema name, and view name.
- Definition Classes
- DataFrame
- Since
-
0.5.0
-
def
createOrReplaceTempView
(
multipartIdentifier:
Seq
[
String
]
)
:
Unit
Creates a temporary view that returns the same results as this DataFrame.
Creates a temporary view that returns the same results as this DataFrame.
You can use the view in subsequent SQL queries and statements during the current session. The temporary view is only available in the session in which it is created.
In
multipartIdentifer
, you can include the database and schema name to specify a fully-qualified name. If no database name or schema name are specified, the view will be created in the current database or schema.The view name must be a valid Snowflake identifier .
- multipartIdentifier
-
A sequence of strings that specify the database name, schema name, and view name.
- Definition Classes
- DataFrame
- Since
-
0.5.0
-
def
createOrReplaceTempView
(
viewName:
String
)
:
Unit
Creates a temporary view that returns the same results as this DataFrame.
Creates a temporary view that returns the same results as this DataFrame.
You can use the view in subsequent SQL queries and statements during the current session. The temporary view is only available in the session in which it is created.
For
viewName
, you can include the database and schema name (i.e. specify a fully-qualified name). If no database name or schema name are specified, the view will be created in the current database or schema.viewName
must be a valid Snowflake identifier .- viewName
-
The name of the view to create or replace.
- Definition Classes
- DataFrame
- Since
-
0.4.0
-
def
createOrReplaceView
(
multipartIdentifier:
List
[
String
]
)
:
Unit
Creates a view that captures the computation expressed by this DataFrame.
Creates a view that captures the computation expressed by this DataFrame.
In
multipartIdentifer
, you can include the database and schema name to specify a fully-qualified name. If no database name or schema name are specified, the view will be created in the current database or schema.The view name must be a valid Snowflake identifier .
- multipartIdentifier
-
A list of strings that specifies the database name, schema name, and view name.
- Definition Classes
- DataFrame
- Since
-
0.5.0
-
def
createOrReplaceView
(
multipartIdentifier:
Seq
[
String
]
)
:
Unit
Creates a view that captures the computation expressed by this DataFrame.
Creates a view that captures the computation expressed by this DataFrame.
In
multipartIdentifer
, you can include the database and schema name to specify a fully-qualified name. If no database name or schema name are specified, the view will be created in the current database or schema.The view name must be a valid Snowflake identifier .
- multipartIdentifier
-
A sequence of strings that specifies the database name, schema name, and view name.
- Definition Classes
- DataFrame
- Since
-
0.5.0
-
def
createOrReplaceView
(
viewName:
String
)
:
Unit
Creates a view that captures the computation expressed by this DataFrame.
Creates a view that captures the computation expressed by this DataFrame.
For
viewName
, you can include the database and schema name (i.e. specify a fully-qualified name). If no database name or schema name are specified, the view will be created in the current database or schema.viewName
must be a valid Snowflake identifier .- viewName
-
The name of the view to create or replace.
- Definition Classes
- DataFrame
- Since
-
0.1.0
-
def
crossJoin
(
right:
DataFrame
)
:
DataFrame
Performs a cross join, which returns the cartesian product of the current DataFrame and another DataFrame (
right
).Performs a cross join, which returns the cartesian product of the current DataFrame and another DataFrame (
right
).If the current and
right
DataFrames have columns with the same name, and you need to refer to one of these columns in the returned DataFrame, use the apply or col function on the current orright
DataFrame to disambiguate references to these columns.For example:
val dfCrossJoin = left.crossJoin(right) val project = dfCrossJoin.select(left("common_col") + right("common_col"))
- Definition Classes
- DataFrame
- Since
-
0.1.0
-
def
cube
(
cols:
Seq
[
String
]
)
:
RelationalGroupedDataFrame
Performs an SQL GROUP BY CUBE
Performs an SQL GROUP BY CUBE
- cols
-
A list of the names of columns to use.
- returns
- Definition Classes
- DataFrame
- Since
-
0.2.0
-
def
cube
(
first:
String
,
remaining:
String
*
)
:
RelationalGroupedDataFrame
Performs an SQL GROUP BY CUBE on the DataFrame.
Performs an SQL GROUP BY CUBE on the DataFrame.
- first
-
The name of the first column to use.
- remaining
-
A list of the names of additional columns to use.
- returns
- Definition Classes
- DataFrame
- Since
-
0.1.0
-
def
cube
(
cols:
Array
[
Column
]
)
:
RelationalGroupedDataFrame
Performs an SQL GROUP BY CUBE on the DataFrame.
Performs an SQL GROUP BY CUBE on the DataFrame.
- cols
-
A list of expressions for columns to use.
- returns
- Definition Classes
- DataFrame
- Since
-
0.9.0
-
def
cube
[
T
]
(
cols:
Seq
[
Column
]
)
(
implicit
arg0:
ClassTag
[
T
]
)
:
RelationalGroupedDataFrame
Performs an SQL GROUP BY CUBE on the DataFrame.
Performs an SQL GROUP BY CUBE on the DataFrame.
- cols
-
A list of expressions for columns to use.
- returns
- Definition Classes
- DataFrame
- Since
-
0.2.0
-
def
cube
(
first:
Column
,
remaining:
Column
*
)
:
RelationalGroupedDataFrame
Performs an SQL GROUP BY CUBE on the DataFrame.
Performs an SQL GROUP BY CUBE on the DataFrame.
- first
-
The expression for the first column to use.
- remaining
-
A list of expressions for additional columns to use.
- returns
- Definition Classes
- DataFrame
- Since
-
0.1.0
-
def
disambiguate
(
lhs:
DataFrame
,
rhs:
DataFrame
,
joinType:
JoinType
,
usingColumns:
Seq
[
String
]
)
: (
DataFrame
,
DataFrame
)
- Attributes
- protected
- Definition Classes
- DataFrame
-
def
distinct
()
:
DataFrame
Returns a new DataFrame that contains only the rows with distinct values from the current DataFrame.
-
def
drop
(
cols:
Array
[
Column
]
)
:
DataFrame
Returns a new DataFrame that excludes the specified column expressions from the output.
Returns a new DataFrame that excludes the specified column expressions from the output.
This is functionally equivalent to calling select and passing in all columns except the ones to exclude.
This method throws a SnowparkClientException if:
- A specified column does not have a name, or
- The resulting DataFrame has no output columns.
- cols
-
An array of the names of the columns to exclude.
- returns
- Definition Classes
- DataFrame
- Since
-
0.7.0
-
def
drop
[
T
]
(
cols:
Seq
[
Column
]
)
(
implicit
arg0:
ClassTag
[
T
]
)
:
DataFrame
Returns a new DataFrame that excludes the specified column expressions from the output.
Returns a new DataFrame that excludes the specified column expressions from the output.
This is functionally equivalent to calling select and passing in all columns except the ones to exclude.
This method throws a SnowparkClientException if:
- A specified column does not have a name, or
- The resulting DataFrame has no output columns.
- cols
-
A list of the names of the columns to exclude.
- returns
- Definition Classes
- DataFrame
- Since
-
0.2.0
-
def
drop
(
first:
Column
,
remaining:
Column
*
)
:
DataFrame
Returns a new DataFrame that excludes the columns specified by the expressions from the output.
Returns a new DataFrame that excludes the columns specified by the expressions from the output.
This is functionally equivalent to calling select and passing in all columns except the ones to exclude.
This method throws a SnowparkClientException if:
- A specified column does not have a name, or
- The resulting DataFrame has no output columns.
- first
-
The expression for the first column to exclude.
- remaining
-
A list of expressions for additional columns to exclude.
- returns
- Definition Classes
- DataFrame
- Since
-
0.1.0
-
def
drop
(
colNames:
Array
[
String
]
)
:
DataFrame
Returns a new DataFrame that excludes the columns with the specified names from the output.
Returns a new DataFrame that excludes the columns with the specified names from the output.
This is functionally equivalent to calling select and passing in all columns except the ones to exclude.
Throws SnowparkClientException if the resulting DataFrame contains no output columns.
- colNames
-
An array of the names of columns to exclude.
- returns
- Definition Classes
- DataFrame
- Since
-
0.7.0
-
def
drop
(
colNames:
Seq
[
String
]
)
:
DataFrame
Returns a new DataFrame that excludes the columns with the specified names from the output.
Returns a new DataFrame that excludes the columns with the specified names from the output.
This is functionally equivalent to calling select and passing in all columns except the ones to exclude.
Throws SnowparkClientException if the resulting DataFrame contains no output columns.
- colNames
-
A list of the names of columns to exclude.
- returns
- Definition Classes
- DataFrame
- Since
-
0.2.0
-
def
drop
(
first:
String
,
remaining:
String
*
)
:
DataFrame
Returns a new DataFrame that excludes the columns with the specified names from the output.
Returns a new DataFrame that excludes the columns with the specified names from the output.
This is functionally equivalent to calling select and passing in all columns except the ones to exclude.
Throws SnowparkClientException if the resulting DataFrame contains no output columns.
- first
-
The name of the first column to exclude.
- remaining
-
A list of the names of additional columns to exclude.
- returns
- Definition Classes
- DataFrame
- Since
-
0.1.0
-
def
dropDuplicates
(
colNames:
String
*
)
:
DataFrame
Creates a new DataFrame by removing duplicated rows on given subset of columns.
Creates a new DataFrame by removing duplicated rows on given subset of columns. If no subset of columns specified, this function is same as distinct() function. The result is non-deterministic when removing duplicated rows from the subset of columns but not all columns. For example: Supposes we have a DataFrame
df
, which contains three rows (a, b, c): (1, 1, 1), (1, 1, 2), (1, 2, 3) The result of df.dropDuplicates("a", "b") can be either (1, 1, 1), (1, 2, 3) or (1, 1, 2), (1, 2, 3)- returns
- Definition Classes
- DataFrame
- Since
-
0.10.0
-
final
def
eq
(
arg0:
AnyRef
)
:
Boolean
- Definition Classes
- AnyRef
-
def
equals
(
arg0:
Any
)
:
Boolean
- Definition Classes
- AnyRef → Any
-
def
except
(
other:
DataFrame
)
:
DataFrame
Returns a new DataFrame that contains all the rows from the current DataFrame except for the rows that also appear in another DataFrame (
other
).Returns a new DataFrame that contains all the rows from the current DataFrame except for the rows that also appear in another DataFrame (
other
). Duplicate rows are eliminated.For example:
val df1except2 = df1.except(df2)
- Definition Classes
- DataFrame
- Since
-
0.1.0
-
def
explain
()
:
Unit
Prints the list of queries that will be executed to evaluate this DataFrame.
-
def
filter
(
condition:
Column
)
:
DataFrame
Filters rows based on the specified conditional expression (similar to WHERE in SQL).
-
def
first
(
n:
Int
)
:
Array
[
Row
]
Executes the query representing this DataFrame and returns the first
n
rows of the results.Executes the query representing this DataFrame and returns the first
n
rows of the results.- n
-
The number of rows to return.
- returns
-
An Array of the first
n
Row objects. Ifn
is negative or larger than the number of rows in the results, returns all rows in the results.
- Definition Classes
- DataFrame
- Since
-
0.2.0
-
def
first
()
:
Option
[
Row
]
Executes the query representing this DataFrame and returns the first row of results.
-
def
flatten
(
input:
Column
,
path:
String
,
outer:
Boolean
,
recursive:
Boolean
,
mode:
String
)
:
DataFrame
Flattens (explodes) compound values into multiple rows (similar to the SQL FLATTEN function).
Flattens (explodes) compound values into multiple rows (similar to the SQL FLATTEN function).
The
flatten
method adds the following columns to the returned DataFrame:- SEQ
- KEY
- PATH
- INDEX
- VALUE
- THIS
If
this
DataFrame also has columns with the names above, you can disambiguate the columns by using thethis("value")
syntax.For example, if the current DataFrame has a column named
value
:val table1 = session.sql( "select parse_json(value) as value from values('[1,2]') as T(value)") val flattened = table1.flatten(table1("value"), "", outer = false, recursive = false, "both") flattened.select(table1("value"), flattened("value").as("newValue")).show()
- input
-
The expression that will be unseated into rows. The expression must be of data type VARIANT, OBJECT, or ARRAY.
- path
-
The path to the element within a VARIANT data structure which needs to be flattened. Can be a zero-length string (i.e. empty path) if the outermost element is to be flattened.
- outer
-
If FALSE, any input rows that cannot be expanded, either because they cannot be accessed in the path or because they have zero fields or entries, are completely omitted from the output. Otherwise, exactly one row is generated for zero-row expansions (with NULL in the KEY, INDEX, and VALUE columns).
- recursive
-
If FALSE, only the element referenced by PATH is expanded. Otherwise, the expansion is performed for all sub-elements recursively.
- mode
-
Specifies whether only OBJECT, ARRAY, or BOTH should be flattened.
- returns
-
A DataFrame containing the flattened values.
- Definition Classes
- DataFrame
- Since
-
0.2.0
-
def
flatten
(
input:
Column
)
:
DataFrame
Flattens (explodes) compound values into multiple rows (similar to the SQL FLATTEN function).
Flattens (explodes) compound values into multiple rows (similar to the SQL FLATTEN function).
The
flatten
method adds the following columns to the returned DataFrame:- SEQ
- KEY
- PATH
- INDEX
- VALUE
- THIS
If
this
DataFrame also has columns with the names above, you can disambiguate the columns by using thethis("value")
syntax.For example, if the current DataFrame has a column named
value
:val table1 = session.sql( "select parse_json(value) as value from values('[1,2]') as T(value)") val flattened = table1.flatten(table1("value")) flattened.select(table1("value"), flattened("value").as("newValue")).show()
- input
-
The expression that will be unseated into rows. The expression must be of data type VARIANT, OBJECT, or ARRAY.
- returns
-
A DataFrame containing the flattened values.
- Definition Classes
- DataFrame
- Since
-
0.2.0
-
final
def
getClass
()
:
Class
[_]
- Definition Classes
- AnyRef → Any
- Annotations
- @native () @HotSpotIntrinsicCandidate ()
-
def
groupBy
(
cols:
Array
[
String
]
)
:
RelationalGroupedDataFrame
Groups rows by the columns specified by name (similar to GROUP BY in SQL).
Groups rows by the columns specified by name (similar to GROUP BY in SQL).
This method returns a RelationalGroupedDataFrame that you can use to perform aggregations on each group of data.
- cols
-
An array of the names of columns to group by.
- returns
- Definition Classes
- DataFrame
- Since
-
0.7.0
-
def
groupBy
(
cols:
Seq
[
String
]
)
:
RelationalGroupedDataFrame
Groups rows by the columns specified by name (similar to GROUP BY in SQL).
Groups rows by the columns specified by name (similar to GROUP BY in SQL).
This method returns a RelationalGroupedDataFrame that you can use to perform aggregations on each group of data.
- cols
-
A list of the names of columns to group by.
- returns
- Definition Classes
- DataFrame
- Since
-
0.2.0
-
def
groupBy
(
first:
String
,
remaining:
String
*
)
:
RelationalGroupedDataFrame
Groups rows by the columns specified by name (similar to GROUP BY in SQL).
Groups rows by the columns specified by name (similar to GROUP BY in SQL).
This method returns a RelationalGroupedDataFrame that you can use to perform aggregations on each group of data.
- first
-
The name of the first column to group by.
- remaining
-
A list of the names of additional columns to group by.
- returns
- Definition Classes
- DataFrame
- Since
-
0.1.0
-
def
groupBy
(
cols:
Array
[
Column
]
)
:
RelationalGroupedDataFrame
Groups rows by the columns specified by expressions (similar to GROUP BY in SQL).
Groups rows by the columns specified by expressions (similar to GROUP BY in SQL).
This method returns a RelationalGroupedDataFrame that you can use to perform aggregations on each group of data.
- cols
-
An array of expressions on columns.
- returns
- Definition Classes
- DataFrame
- Since
-
0.7.0
-
def
groupBy
[
T
]
(
cols:
Seq
[
Column
]
)
(
implicit
arg0:
ClassTag
[
T
]
)
:
RelationalGroupedDataFrame
Groups rows by the columns specified by expressions (similar to GROUP BY in SQL).
Groups rows by the columns specified by expressions (similar to GROUP BY in SQL).
This method returns a RelationalGroupedDataFrame that you can use to perform aggregations on each group of data.
- cols
-
A list of expressions on columns.
- returns
- Definition Classes
- DataFrame
- Since
-
0.2.0
-
def
groupBy
()
:
RelationalGroupedDataFrame
Returns a RelationalGroupedDataFrame that you can use to perform aggregations on the underlying DataFrame.
Returns a RelationalGroupedDataFrame that you can use to perform aggregations on the underlying DataFrame.
- returns
- Definition Classes
- DataFrame
- Since
-
0.1.0
-
def
groupBy
(
first:
Column
,
remaining:
Column
*
)
:
RelationalGroupedDataFrame
Groups rows by the columns specified by expressions (similar to GROUP BY in SQL).
Groups rows by the columns specified by expressions (similar to GROUP BY in SQL).
This method returns a RelationalGroupedDataFrame that you can use to perform aggregations on each group of data.
- first
-
The expression for the first column to group by.
- remaining
-
A list of expressions for additional columns to group by.
- returns
- Definition Classes
- DataFrame
- Since
-
0.1.0
-
def
groupByGroupingSets
(
groupingSets:
Seq
[
GroupingSets
]
)
:
RelationalGroupedDataFrame
Performs an SQL GROUP BY GROUPING SETS on the DataFrame.
Performs an SQL GROUP BY GROUPING SETS on the DataFrame.
GROUP BY GROUPING SETS is an extension of the GROUP BY clause that allows computing multiple group-by clauses in a single statement. The group set is a set of dimension columns.
GROUP BY GROUPING SETS is equivalent to the UNION of two or more GROUP BY operations in the same result set:
df.groupByGroupingSets(GroupingSets(Set(col("a"))))
is equivalent todf.groupBy("a")
and
df.groupByGroupingSets(GroupingSets(Set(col("a")), Set(col("b"))))
is equivalent todf.groupBy("a")
uniondf.groupBy("b")
- groupingSets
-
A list of GroupingSets objects.
- Definition Classes
- DataFrame
- Since
-
0.4.0
-
def
groupByGroupingSets
(
first:
GroupingSets
,
remaining:
GroupingSets
*
)
:
RelationalGroupedDataFrame
Performs an SQL GROUP BY GROUPING SETS on the DataFrame.
Performs an SQL GROUP BY GROUPING SETS on the DataFrame.
GROUP BY GROUPING SETS is an extension of the GROUP BY clause that allows computing multiple GROUP BY clauses in a single statement. The group set is a set of dimension columns.
GROUP BY GROUPING SETS is equivalent to the UNION of two or more GROUP BY operations in the same result set:
df.groupByGroupingSets(GroupingSets(Set(col("a"))))
is equivalent todf.groupBy("a")
and
df.groupByGroupingSets(GroupingSets(Set(col("a")), Set(col("b"))))
is equivalent todf.groupBy("a")
uniondf.groupBy("b")
- first
-
A GroupingSets object.
- remaining
-
A list of additional GroupingSets objects.
- Definition Classes
- DataFrame
- Since
-
0.4.0
-
def
hashCode
()
:
Int
- Definition Classes
- AnyRef → Any
- Annotations
- @native () @HotSpotIntrinsicCandidate ()
-
def
intersect
(
other:
DataFrame
)
:
DataFrame
Returns a new DataFrame that contains the intersection of rows from the current DataFrame and another DataFrame (
other
).Returns a new DataFrame that contains the intersection of rows from the current DataFrame and another DataFrame (
other
). Duplicate rows are eliminated.For example:
val dfIntersectionOf1and2 = df1.intersect(df2)
- Definition Classes
- DataFrame
- Since
-
0.1.0
-
final
def
isInstanceOf
[
T0
]
:
Boolean
- Definition Classes
- Any
-
def
join
(
func:
Column
,
partitionBy:
Seq
[
Column
]
,
orderBy:
Seq
[
Column
]
)
:
DataFrame
Joins the current DataFrame with the output of the specified user-defined table function (UDTF)
func
.Joins the current DataFrame with the output of the specified user-defined table function (UDTF)
func
.To specify a PARTITION BY or ORDER BY clause, use the
partitionBy
andorderBy
arguments.For example:
val tf = session.udtf.registerTemporary(TableFunc1) df.join(tf(Map("arg1" -> df("col1")),Seq(df("col2")), Seq(df("col1"))))
- func
-
TableFunction object that represents a user-defined table function.
- partitionBy
-
A list of columns partitioned by.
- orderBy
-
A list of columns ordered by.
- Definition Classes
- DataFrame
- Since
-
1.10.0
-
def
join
(
func:
Column
)
:
DataFrame
Joins the current DataFrame with the output of the specified table function
func
.Joins the current DataFrame with the output of the specified table function
func
.For example:
// The following example uses the flatten function to explode compound values from // column 'a' in this DataFrame into multiple columns. import com.snowflake.snowpark.functions._ import com.snowflake.snowpark.tableFunctions._ df.join( tableFunctions.flatten(parse_json(df("a"))) )
- func
-
TableFunction object, which can be one of the values in the tableFunctions object or an object that you create from the TableFunction.apply() .
- Definition Classes
- DataFrame
- Since
-
1.10.0
-
def
join
(
func:
TableFunction
,
args:
Map
[
String
,
Column
]
,
partitionBy:
Seq
[
Column
]
,
orderBy:
Seq
[
Column
]
)
:
DataFrame
Joins the current DataFrame with the output of the specified user-defined table function (UDTF)
func
.Joins the current DataFrame with the output of the specified user-defined table function (UDTF)
func
.To pass arguments to the table function, use the
args
argument of this method. Pass in aMap
of parameter names and values. In these values, you can include references to columns in this DataFrame.To specify a PARTITION BY or ORDER BY clause, use the
partitionBy
andorderBy
arguments.For example:
// The following example passes the values in the column `col1` to the // user-defined tabular function (UDTF) `udtf`, partitioning the // data by `col2` and sorting the data by `col1`. The example returns // a new DataFrame that joins the contents of the current DataFrame with // the output of the UDTF. df.join( tableFunction("udtf"), Map("arg1" -> df("col1"), Seq(df("col2")), Seq(df("col1"))) )
- func
-
TableFunction object that represents a user-defined table function (UDTF).
- args
-
Map of arguments to pass to the specified table function. Some functions, like
flatten
, have named parameters. Use this map to specify the parameter names and their corresponding values. - partitionBy
-
A list of columns partitioned by.
- orderBy
-
A list of columns ordered by.
- Definition Classes
- DataFrame
- Since
-
1.7.0
-
def
join
(
func:
TableFunction
,
args:
Map
[
String
,
Column
]
)
:
DataFrame
Joins the current DataFrame with the output of the specified table function
func
that takes named parameters (e.g.Joins the current DataFrame with the output of the specified table function
func
that takes named parameters (e.g.flatten
).To pass arguments to the table function, use the
args
argument of this method. Pass in aMap
of parameter names and values. In these values, you can include references to columns in this DataFrame.For example:
// The following example uses the flatten function to explode compound values from // column 'a' in this DataFrame into multiple columns. import com.snowflake.snowpark.functions._ import com.snowflake.snowpark.tableFunctions._ df.join( tableFunction("flatten"), Map("input" -> parse_json(df("a"))) )
- func
-
TableFunction object, which can be one of the values in the tableFunctions object or an object that you create from the TableFunction class.
- args
-
Map of arguments to pass to the specified table function. Some functions, like
flatten
, have named parameters. Use this map to specify the parameter names and their corresponding values.
- Definition Classes
- DataFrame
- Since
-
0.4.0
-
def
join
(
func:
TableFunction
,
args:
Seq
[
Column
]
,
partitionBy:
Seq
[
Column
]
,
orderBy:
Seq
[
Column
]
)
:
DataFrame
Joins the current DataFrame with the output of the specified user-defined table function (UDTF)
func
.Joins the current DataFrame with the output of the specified user-defined table function (UDTF)
func
.To pass arguments to the table function, use the
args
argument of this method. In the table function arguments, you can include references to columns in this DataFrame.To specify a PARTITION BY or ORDER BY clause, use the
partitionBy
andorderBy
arguments.For example:
// The following example passes the values in the column `col1` to the // user-defined tabular function (UDTF) `udtf`, partitioning the // data by `col2` and sorting the data by `col1`. The example returns // a new DataFrame that joins the contents of the current DataFrame with // the output of the UDTF. df.join(TableFunction("udtf"), Seq(df("col1")), Seq(df("col2")), Seq(df("col1")))
- func
-
TableFunction object that represents a user-defined table function (UDTF).
- args
-
A list of arguments to pass to the specified table function.
- partitionBy
-
A list of columns partitioned by.
- orderBy
-
A list of columns ordered by.
- Definition Classes
- DataFrame
- Since
-
1.7.0
-
def
join
(
func:
TableFunction
,
args:
Seq
[
Column
]
)
:
DataFrame
Joins the current DataFrame with the output of the specified table function
func
.Joins the current DataFrame with the output of the specified table function
func
.To pass arguments to the table function, use the
args
argument of this method. In the table function arguments, you can include references to columns in this DataFrame.For example:
// The following example uses the split_to_table function to split // column 'a' in this DataFrame on the character ','. // Each row in this DataFrame will produce N rows in the resulting DataFrame, // where N is the number of tokens in the column 'a'. import com.snowflake.snowpark.functions._ import com.snowflake.snowpark.tableFunctions._ df.join(split_to_table, Seq(df("a"), lit(",")))
- func
-
TableFunction object, which can be one of the values in the tableFunctions object or an object that you create from the TableFunction class.
- args
-
A list of arguments to pass to the specified table function.
- Definition Classes
- DataFrame
- Since
-
0.4.0
-
def
join
(
func:
TableFunction
,
firstArg:
Column
,
remaining:
Column
*
)
:
DataFrame
Joins the current DataFrame with the output of the specified table function
func
.Joins the current DataFrame with the output of the specified table function
func
.To pass arguments to the table function, use the
firstArg
andremaining
arguments of this method. In the table function arguments, you can include references to columns in this DataFrame.For example:
// The following example uses the split_to_table function to split // column 'a' in this DataFrame on the character ','. // Each row in the current DataFrame will produce N rows in the resulting DataFrame, // where N is the number of tokens in the column 'a'. import com.snowflake.snowpark.functions._ import com.snowflake.snowpark.tableFunctions._ df.join(split_to_table, df("a"), lit(","))
- func
-
TableFunction object, which can be one of the values in the tableFunctions object or an object that you create from the TableFunction class.
- firstArg
-
The first argument to pass to the specified table function.
- remaining
-
A list of any additional arguments for the specified table function.
- Definition Classes
- DataFrame
- Since
-
0.4.0
-
def
join
(
right:
DataFrame
,
joinExprs:
Column
,
joinType:
String
)
:
DataFrame
Performs a join of the specified type (
joinType
) with the current DataFrame and another DataFrame (right
) using the join condition specified in an expression (joinExpr
).Performs a join of the specified type (
joinType
) with the current DataFrame and another DataFrame (right
) using the join condition specified in an expression (joinExpr
).To disambiguate columns with the same name in the left DataFrame and right DataFrame, use the apply or col method of each DataFrame (
df("col")
ordf.col("col")
). You can use this approach to disambiguate columns in thejoinExprs
parameter and to refer to columns in the returned DataFrame.For example:
val dfJoin = df1.join(df2, df1("a") === df2("b"), "left") val dfJoin2 = df1.join(df2, df1("a") === df2("b") && df1("c" === df2("d"), "outer") val dfJoin3 = df1.join(df2, df1("a") === df2("a") && df1("b" === df2("b"), "outer") // If both df1 and df2 contain column 'c' val project = dfJoin3.select(df1("c") + df2("c"))
If you need to join a DataFrame with itself, keep in mind that there is no way to distinguish between columns on the left and right sides in a join expression. For example:
val dfJoined = df.join(df, df("a") === df("b"), joinType) // Column references are ambiguous
To do a self-join, you can you either clone( clone ) the DataFrame as follows,
val clonedDf = df.clone val dfJoined = df.join(clonedDf, df("a") === clonedDf("b"), joinType)
or you can call a join method that allows you to pass in 'usingColumns' parameter.
- right
-
The other DataFrame to join.
- joinExprs
-
Expression that specifies the join condition.
- joinType
-
The type of join (e.g.
"right"
,"outer"
, etc.). - returns
- Definition Classes
- DataFrame
- Since
-
0.1.0
-
def
join
(
right:
DataFrame
,
joinExprs:
Column
)
:
DataFrame
Performs a default inner join of the current DataFrame and another DataFrame (
right
) using the join condition specified in an expression (joinExpr
).Performs a default inner join of the current DataFrame and another DataFrame (
right
) using the join condition specified in an expression (joinExpr
).To disambiguate columns with the same name in the left DataFrame and right DataFrame, use the apply or col method of each DataFrame (
df("col")
ordf.col("col")
). You can use this approach to disambiguate columns in thejoinExprs
parameter and to refer to columns in the returned DataFrame.For example:
val dfJoin = df1.join(df2, df1("a") === df2("b")) val dfJoin2 = df1.join(df2, df1("a") === df2("b") && df1("c" === df2("d")) val dfJoin3 = df1.join(df2, df1("a") === df2("a") && df1("b" === df2("b")) // If both df1 and df2 contain column 'c' val project = dfJoin3.select(df1("c") + df2("c"))
If you need to join a DataFrame with itself, keep in mind that there is no way to distinguish between columns on the left and right sides in a join expression. For example:
val dfJoined = df.join(df, df("a") === df("b")) // Column references are ambiguous
As a workaround, you can either construct the left and right DataFrames separately, or you can call a join method that allows you to pass in 'usingColumns' parameter.
- right
-
The other DataFrame to join.
- joinExprs
-
Expression that specifies the join condition.
- returns
- Definition Classes
- DataFrame
- Since
-
0.1.0
-
def
join
(
right:
DataFrame
,
usingColumns:
Seq
[
String
]
,
joinType:
String
)
:
DataFrame
Performs a join of the specified type (
joinType
) with the current DataFrame and another DataFrame (right
) on a list of columns (usingColumns
).Performs a join of the specified type (
joinType
) with the current DataFrame and another DataFrame (right
) on a list of columns (usingColumns
).The method assumes that the columns in
usingColumns
have the same meaning in the left and right DataFrames.For example:
val dfLeftJoin = df1.join(df2, Seq("a"), "left") val dfOuterJoin = df1.join(df2, Seq("a", "b"), "outer")
- right
-
The other DataFrame to join.
- usingColumns
-
A list of the names of the columns to use for the join.
- joinType
-
The type of join (e.g.
"right"
,"outer"
, etc.). - returns
- Definition Classes
- DataFrame
- Since
-
0.1.0
-
def
join
(
right:
DataFrame
,
usingColumns:
Seq
[
String
]
)
:
DataFrame
Performs a default inner join of the current DataFrame and another DataFrame (
right
) on a list of columns (usingColumns
).Performs a default inner join of the current DataFrame and another DataFrame (
right
) on a list of columns (usingColumns
).The method assumes that the columns in
usingColumns
have the same meaning in the left and right DataFrames.For example:
val dfJoinOnColA = df.join(df2, Seq("a")) val dfJoinOnColAAndColB = df.join(df2, Seq("a", "b"))
- right
-
The other DataFrame to join.
- usingColumns
-
A list of the names of the columns to use for the join.
- returns
- Definition Classes
- DataFrame
- Since
-
0.1.0
-
def
join
(
right:
DataFrame
,
usingColumn:
String
)
:
DataFrame
Performs a default inner join of the current DataFrame and another DataFrame (
right
) on a column (usingColumn
).Performs a default inner join of the current DataFrame and another DataFrame (
right
) on a column (usingColumn
).The method assumes that the
usingColumn
column has the same meaning in the left and right DataFrames.For example:
val result = left.join(right, "a")
- right
-
The other DataFrame to join.
- usingColumn
-
The name of the column to use for the join.
- returns
- Definition Classes
- DataFrame
- Since
-
0.1.0
-
def
join
(
right:
DataFrame
)
:
DataFrame
Performs a default inner join of the current DataFrame and another DataFrame (
right
).Performs a default inner join of the current DataFrame and another DataFrame (
right
).Because this method does not specify a join condition, the returned DataFrame is a cartesian product of the two DataFrames.
If the current and
right
DataFrames have columns with the same name, and you need to refer to one of these columns in the returned DataFrame, use the apply or col function on the current orright
DataFrame to disambiguate references to these columns.For example:
val result = left.join(right) val project = result.select(left("common_col") + right("common_col"))
- Definition Classes
- DataFrame
- Since
-
0.1.0
-
def
limit
(
n:
Int
)
:
DataFrame
Returns a new DataFrame that contains at most n rows from the current DataFrame (similar to LIMIT in SQL).
-
def
log
()
:
Logger
- Attributes
- protected[ internal ]
- Definition Classes
- Logging
-
def
logDebug
(
msg:
String
,
throwable:
Throwable
)
:
Unit
- Attributes
- protected[ internal ]
- Definition Classes
- Logging
-
def
logDebug
(
msg:
String
)
:
Unit
- Attributes
- protected[ internal ]
- Definition Classes
- Logging
-
def
logError
(
msg:
String
,
throwable:
Throwable
)
:
Unit
- Attributes
- protected[ internal ]
- Definition Classes
- Logging
-
def
logError
(
msg:
String
)
:
Unit
- Attributes
- protected[ internal ]
- Definition Classes
- Logging
-
def
logInfo
(
msg:
String
,
throwable:
Throwable
)
:
Unit
- Attributes
- protected[ internal ]
- Definition Classes
- Logging
-
def
logInfo
(
msg:
String
)
:
Unit
- Attributes
- protected[ internal ]
- Definition Classes
- Logging
-
def
logTrace
(
msg:
String
,
throwable:
Throwable
)
:
Unit
- Attributes
- protected[ internal ]
- Definition Classes
- Logging
-
def
logTrace
(
msg:
String
)
:
Unit
- Attributes
- protected[ internal ]
- Definition Classes
- Logging
-
def
logWarning
(
msg:
String
,
throwable:
Throwable
)
:
Unit
- Attributes
- protected[ internal ]
- Definition Classes
- Logging
-
def
logWarning
(
msg:
String
)
:
Unit
- Attributes
- protected[ internal ]
- Definition Classes
- Logging
-
lazy val
na
:
DataFrameNaFunctions
Returns a DataFrameNaFunctions object that provides functions for handling missing values in the DataFrame.
Returns a DataFrameNaFunctions object that provides functions for handling missing values in the DataFrame.
- Definition Classes
- DataFrame
- Since
-
0.2.0
-
def
naturalJoin
(
right:
DataFrame
,
joinType:
String
)
:
DataFrame
Performs a natural join of the specified type (
joinType
) with the current DataFrame and another DataFrame (right
).Performs a natural join of the specified type (
joinType
) with the current DataFrame and another DataFrame (right
).For example:
val dfNaturalJoin = df.naturalJoin(df2, "left")
- right
-
The other DataFrame to join.
- joinType
-
The type of join (e.g.
"right"
,"outer"
, etc.). - returns
- Definition Classes
- DataFrame
- Since
-
0.1.0
-
def
naturalJoin
(
right:
DataFrame
)
:
DataFrame
Performs a natural join (a default inner join) of the current DataFrame and another DataFrame (
right
).Performs a natural join (a default inner join) of the current DataFrame and another DataFrame (
right
).For example:
val dfNaturalJoin = df.naturalJoin(df2)
Note that this is equivalent to:
val dfNaturalJoin = df.naturalJoin(df2, "inner")
- Definition Classes
- DataFrame
- Since
-
0.1.0
-
final
def
ne
(
arg0:
AnyRef
)
:
Boolean
- Definition Classes
- AnyRef
-
final
def
notify
()
:
Unit
- Definition Classes
- AnyRef
- Annotations
- @native () @HotSpotIntrinsicCandidate ()
-
final
def
notifyAll
()
:
Unit
- Definition Classes
- AnyRef
- Annotations
- @native () @HotSpotIntrinsicCandidate ()
-
def
pivot
(
pivotColumn:
Column
,
values:
Seq
[
Any
]
)
:
RelationalGroupedDataFrame
Rotates this DataFrame by turning the unique values from one column in the input expression into multiple columns and aggregating results where required on any remaining column values.
Rotates this DataFrame by turning the unique values from one column in the input expression into multiple columns and aggregating results where required on any remaining column values.
Only one aggregate is supported with pivot.
For example:
val dfPivoted = df.pivot(col("col_1"), Seq(1,2,3)).agg(sum(col("col_2")))
- pivotColumn
-
Expression for the column that you want to use.
- values
-
A list of values in the column.
- returns
- Definition Classes
- DataFrame
- Since
-
0.1.0
-
def
pivot
(
pivotColumn:
String
,
values:
Seq
[
Any
]
)
:
RelationalGroupedDataFrame
Rotates this DataFrame by turning the unique values from one column in the input expression into multiple columns and aggregating results where required on any remaining column values.
Rotates this DataFrame by turning the unique values from one column in the input expression into multiple columns and aggregating results where required on any remaining column values.
Only one aggregate is supported with pivot.
For example:
val dfPivoted = df.pivot("col_1", Seq(1,2,3)).agg(sum(col("col_2")))
- pivotColumn
-
The name of the column to use.
- values
-
A list of values in the column.
- returns
- Definition Classes
- DataFrame
- Since
-
0.1.0
-
def
randomSplit
(
weights:
Array
[
Double
]
)
:
Array
[
DataFrame
]
Randomly splits the current DataFrame into separate DataFrames, using the specified weights.
Randomly splits the current DataFrame into separate DataFrames, using the specified weights.
NOTE:
- If only one weight is specified, the returned DataFrame array only includes the current DataFrame.
- If multiple weights are specified, the current DataFrame will be cached before being split.
- weights
-
Weights to use for splitting the DataFrame. If the weights don't add up to 1, the weights will be normalized.
- returns
-
A list of DataFrame objects
- Definition Classes
- DataFrame
- Since
-
0.2.0
-
def
rename
(
newName:
String
,
col:
Column
)
:
DataFrame
Returns a DataFrame with the specified column
col
renamed asnewName
.Returns a DataFrame with the specified column
col
renamed asnewName
.This example renames the column
A
asNEW_A
in the DataFrame.val df = session.sql("select 1 as A, 2 as B") val dfRenamed = df.rename("NEW_A", col("A"))
- Definition Classes
- DataFrame
- Since
-
0.9.0
-
def
rollup
(
cols:
Array
[
String
]
)
:
RelationalGroupedDataFrame
Performs an SQL GROUP BY ROLLUP on the DataFrame.
Performs an SQL GROUP BY ROLLUP on the DataFrame.
- cols
-
An array of column names.
- returns
- Definition Classes
- DataFrame
- Since
-
0.7.0
-
def
rollup
(
cols:
Seq
[
String
]
)
:
RelationalGroupedDataFrame
Performs an SQL GROUP BY ROLLUP on the DataFrame.
Performs an SQL GROUP BY ROLLUP on the DataFrame.
- cols
-
A list of column names.
- returns
- Definition Classes
- DataFrame
- Since
-
0.2.0
-
def
rollup
(
first:
String
,
remaining:
String
*
)
:
RelationalGroupedDataFrame
Performs an SQL GROUP BY ROLLUP on the DataFrame.
Performs an SQL GROUP BY ROLLUP on the DataFrame.
- first
-
The name of the first column.
- remaining
-
A list of the names of additional columns.
- returns
- Definition Classes
- DataFrame
- Since
-
0.1.0
-
def
rollup
(
cols:
Array
[
Column
]
)
:
RelationalGroupedDataFrame
Performs an SQL GROUP BY ROLLUP on the DataFrame.
Performs an SQL GROUP BY ROLLUP on the DataFrame.
- cols
-
An array of expressions on columns.
- returns
- Definition Classes
- DataFrame
- Since
-
0.7.0
-
def
rollup
[
T
]
(
cols:
Seq
[
Column
]
)
(
implicit
arg0:
ClassTag
[
T
]
)
:
RelationalGroupedDataFrame
Performs an SQL GROUP BY ROLLUP on the DataFrame.
Performs an SQL GROUP BY ROLLUP on the DataFrame.
- cols
-
A list of expressions on columns.
- returns
- Definition Classes
- DataFrame
- Since
-
0.2.0
-
def
rollup
(
first:
Column
,
remaining:
Column
*
)
:
RelationalGroupedDataFrame
Performs an SQL GROUP BY ROLLUP on the DataFrame.
Performs an SQL GROUP BY ROLLUP on the DataFrame.
- first
-
The expression for the first column.
- remaining
-
A list of expressions for additional columns.
- returns
- Definition Classes
- DataFrame
- Since
-
0.1.0
-
def
sample
(
probabilityFraction:
Double
)
:
DataFrame
Returns a new DataFrame that contains a sampling of rows from the current DataFrame.
Returns a new DataFrame that contains a sampling of rows from the current DataFrame.
NOTE:
-
The number of rows returned may be close to (but not exactly equal to)
(probabilityFraction * totalRowCount)
. -
The Snowflake
SAMPLE
function
supports specifying 'probability' as a percentage number.
The range of 'probability' is
[0.0, 100.0]
. The conversion formula isprobability = probabilityFraction * 100
.
- probabilityFraction
-
The fraction of rows to sample. This must be in the range of
0.0
to1.0
. - returns
-
A DataFrame containing the sample of rows.
- Definition Classes
- DataFrame
- Since
-
0.2.0
-
The number of rows returned may be close to (but not exactly equal to)
-
def
sample
(
num:
Long
)
:
DataFrame
Returns a new DataFrame with a sample of N rows from the underlying DataFrame.
Returns a new DataFrame with a sample of N rows from the underlying DataFrame.
NOTE:
- If the row count in the DataFrame is larger than the requested number of rows, the method returns a DataFrame containing the number of requested rows.
- If the row count in the DataFrame is smaller than the requested number of rows, the method returns a DataFrame containing all rows.
- num
-
The number of rows to sample in the range of 0 to 1,000,000.
- returns
-
A DataFrame containing the sample of
num
rows.
- Definition Classes
- DataFrame
- Since
-
0.2.0
-
lazy val
schema
:
StructType
Returns the definition of the columns in this DataFrame (the "relational schema" for the DataFrame).
Returns the definition of the columns in this DataFrame (the "relational schema" for the DataFrame).
- Definition Classes
- DataFrame
- Since
-
0.1.0
-
def
select
(
columns:
Array
[
String
]
)
:
DataFrame
Returns a new DataFrame with a subset of named columns (similar to SELECT in SQL).
-
def
select
(
columns:
Seq
[
String
]
)
:
DataFrame
Returns a new DataFrame with a subset of named columns (similar to SELECT in SQL).
-
def
select
(
first:
String
,
remaining:
String
*
)
:
DataFrame
Returns a new DataFrame with a subset of named columns (similar to SELECT in SQL).
Returns a new DataFrame with a subset of named columns (similar to SELECT in SQL).
For example:
val dfSelected = df.select("col1", "col2", "col3")
- first
-
The name of the first column to return.
- remaining
-
A list of the names of the additional columns to return.
- returns
- Definition Classes
- DataFrame
- Since
-
0.1.0
-
def
select
(
columns:
Array
[
Column
]
)
:
DataFrame
Returns a new DataFrame with the specified Column expressions as output (similar to SELECT in SQL).
Returns a new DataFrame with the specified Column expressions as output (similar to SELECT in SQL). Only the Columns specified as arguments will be present in the resulting DataFrame.
You can use any Column expression.
For example:
val dfSelected = df.select(Array(df.col("col1"), lit("abc"), df.col("col1") + df.col("col2")))
- columns
-
An array of expressions for the columns to return.
- returns
- Definition Classes
- DataFrame
- Since
-
0.7.0
-
def
select
[
T
]
(
columns:
Seq
[
Column
]
)
(
implicit
arg0:
ClassTag
[
T
]
)
:
DataFrame
Returns a new DataFrame with the specified Column expressions as output (similar to SELECT in SQL).
Returns a new DataFrame with the specified Column expressions as output (similar to SELECT in SQL). Only the Columns specified as arguments will be present in the resulting DataFrame.
You can use any Column expression.
For example:
val dfSelected = df.select(Seq($"col1", substring($"col2", 0, 10), df("col3") + df("col4")))
- columns
-
A list of expressions for the columns to return.
- returns
- Definition Classes
- DataFrame
- Since
-
0.2.0
-
def
select
(
first:
Column
,
remaining:
Column
*
)
:
DataFrame
Returns a new DataFrame with the specified Column expressions as output (similar to SELECT in SQL).
Returns a new DataFrame with the specified Column expressions as output (similar to SELECT in SQL). Only the Columns specified as arguments will be present in the resulting DataFrame.
You can use any Column expression.
For example:
val dfSelected = df.select($"col1", substring($"col2", 0, 10), df("col3") + df("col4"))
- first
-
The expression for the first column to return.
- remaining
-
A list of expressions for the additional columns to return.
- returns
- Definition Classes
- DataFrame
- Since
-
0.1.0
-
def
show
(
n:
Int
,
maxWidth:
Int
)
:
Unit
Evaluates this DataFrame and prints out the first
n
rows with the specified maximum number of characters per column.Evaluates this DataFrame and prints out the first
n
rows with the specified maximum number of characters per column.- n
-
The number of rows to print out.
- maxWidth
-
The maximum number of characters to print out for each column. If the number of characters exceeds the maximum, the method prints out an ellipsis (...) at the end of the column.
- Definition Classes
- DataFrame
- Since
-
0.5.0
-
def
show
(
n:
Int
)
:
Unit
Evaluates this DataFrame and prints out the first
n
rows.Evaluates this DataFrame and prints out the first
n
rows.- n
-
The number of rows to print out.
- Definition Classes
- DataFrame
- Since
-
0.1.0
-
def
show
()
:
Unit
Evaluates this DataFrame and prints out the first ten rows.
Evaluates this DataFrame and prints out the first ten rows.
- Definition Classes
- DataFrame
- Since
-
0.1.0
-
def
sort
(
sortExprs:
Array
[
Column
]
)
:
DataFrame
Sorts a DataFrame by the specified expressions (similar to ORDER BY in SQL).
-
def
sort
(
sortExprs:
Seq
[
Column
]
)
:
DataFrame
Sorts a DataFrame by the specified expressions (similar to ORDER BY in SQL).
-
def
sort
(
first:
Column
,
remaining:
Column
*
)
:
DataFrame
Sorts a DataFrame by the specified expressions (similar to ORDER BY in SQL).
Sorts a DataFrame by the specified expressions (similar to ORDER BY in SQL).
For example:
val dfSorted = df.sort($"colA", $"colB".asc)
- first
-
The first Column expression for sorting the DataFrame.
- remaining
-
Additional Column expressions for sorting the DataFrame.
- returns
- Definition Classes
- DataFrame
- Since
-
0.1.0
-
lazy val
stat
:
DataFrameStatFunctions
Returns a DataFrameStatFunctions object that provides statistic functions.
Returns a DataFrameStatFunctions object that provides statistic functions.
- Definition Classes
- DataFrame
- Since
-
0.2.0
-
final
def
synchronized
[
T0
]
(
arg0: ⇒
T0
)
:
T0
- Definition Classes
- AnyRef
-
def
toDF
(
colNames:
Array
[
String
]
)
:
DataFrame
Creates a new DataFrame containing the data in the current DataFrame but in columns with the specified names.
Creates a new DataFrame containing the data in the current DataFrame but in columns with the specified names.
You can use this method to assign column names when constructing a DataFrame. For example:
For example:
val df = session.createDataFrame(Seq((1, "a"))).toDF(Array("a", "b"))
This returns a DataFrame containing the following:
------------- |"A" |"B" | ------------- |1 |2 | |3 |4 | -------------
If you imported <session_var>.implicits._ , you can use the following syntax to create the DataFrame from a
Seq
and calltoDF
to assign column names to the returned DataFrame:import mysession.implicits_ var df = Seq((1, 2), (3, 4)).toDF(Array("a", "b"))
The number of column names that you pass in must match the number of columns in the current DataFrame.
- colNames
-
An array of column names.
- returns
- Definition Classes
- DataFrame
- Since
-
0.7.0
-
def
toDF
(
colNames:
Seq
[
String
]
)
:
DataFrame
Creates a new DataFrame containing the data in the current DataFrame but in columns with the specified names.
Creates a new DataFrame containing the data in the current DataFrame but in columns with the specified names.
You can use this method to assign column names when constructing a DataFrame. For example:
For example:
var df = session.createDataFrame(Seq((1, 2), (3, 4))).toDF(Seq("a", "b"))
This returns a DataFrame containing the following:
------------- |"A" |"B" | ------------- |1 |2 | |3 |4 | -------------
If you imported <session_var>.implicits._ , you can use the following syntax to create the DataFrame from a
Seq
and calltoDF
to assign column names to the returned DataFrame:import mysession.implicits_ var df = Seq((1, 2), (3, 4)).toDF(Seq("a", "b"))
The number of column names that you pass in must match the number of columns in the current DataFrame.
- colNames
-
A list of column names.
- returns
- Definition Classes
- DataFrame
- Since
-
0.2.0
-
def
toDF
(
first:
String
,
remaining:
String
*
)
:
DataFrame
Creates a new DataFrame containing the columns with the specified names.
Creates a new DataFrame containing the columns with the specified names.
You can use this method to assign column names when constructing a DataFrame. For example:
For example:
var df = session.createDataFrame(Seq((1, "a")).toDF(Seq("a", "b"))
This returns a DataFrame containing the following:
------------- |"A" |"B" | ------------- |1 |2 | |3 |4 | -------------
if you imported <session_var>.implicits._ , you can use the following syntax to create the DataFrame from a
Seq
and calltoDF
to assign column names to the returned DataFrame:import mysession.implicits_ var df = Seq((1, 2), (3, 4)).toDF(Seq("a", "b"))
The number of column names that you pass in must match the number of columns in the current DataFrame.
- first
-
The name of the first column.
- remaining
-
A list of the rest of the column names.
- returns
- Definition Classes
- DataFrame
- Since
-
0.1.0
-
def
toLocalIterator
:
Iterator
[
Row
]
Executes the query representing this DataFrame and returns an iterator of Row objects that you can use to retrieve the results.
-
def
toString
()
:
String
- Definition Classes
- AnyRef → Any
-
def
union
(
other:
DataFrame
)
:
DataFrame
Returns a new DataFrame that contains all the rows in the current DataFrame and another DataFrame (
other
), excluding any duplicate rows.Returns a new DataFrame that contains all the rows in the current DataFrame and another DataFrame (
other
), excluding any duplicate rows. Both input DataFrames must contain the same number of columns.For example:
val df1and2 = df1.union(df2)
- Definition Classes
- DataFrame
- Since
-
0.1.0
-
def
unionAll
(
other:
DataFrame
)
:
DataFrame
Returns a new DataFrame that contains all the rows in the current DataFrame and another DataFrame (
other
), including any duplicate rows.Returns a new DataFrame that contains all the rows in the current DataFrame and another DataFrame (
other
), including any duplicate rows. Both input DataFrames must contain the same number of columns.For example:
val df1and2 = df1.unionAll(df2)
- Definition Classes
- DataFrame
- Since
-
0.1.0
-
def
unionAllByName
(
other:
DataFrame
)
:
DataFrame
Returns a new DataFrame that contains all the rows in the current DataFrame and another DataFrame (
other
), including any duplicate rows.Returns a new DataFrame that contains all the rows in the current DataFrame and another DataFrame (
other
), including any duplicate rows.This method matches the columns in the two DataFrames by their names, not by their positions. The columns in the other DataFrame are rearranged to match the order of columns in the current DataFrame.
For example:
val df1and2 = df1.unionAllByName(df2)
- Definition Classes
- DataFrame
- Since
-
0.9.0
-
def
unionByName
(
other:
DataFrame
)
:
DataFrame
Returns a new DataFrame that contains all the rows in the current DataFrame and another DataFrame (
other
), excluding any duplicate rows.Returns a new DataFrame that contains all the rows in the current DataFrame and another DataFrame (
other
), excluding any duplicate rows.This method matches the columns in the two DataFrames by their names, not by their positions. The columns in the other DataFrame are rearranged to match the order of columns in the current DataFrame.
For example:
val df1and2 = df1.unionByName(df2)
- Definition Classes
- DataFrame
- Since
-
0.1.0
-
final
def
wait
(
arg0:
Long
,
arg1:
Int
)
:
Unit
- Definition Classes
- AnyRef
- Annotations
- @throws ( ... )
-
final
def
wait
(
arg0:
Long
)
:
Unit
- Definition Classes
- AnyRef
- Annotations
- @throws ( ... ) @native ()
-
final
def
wait
()
:
Unit
- Definition Classes
- AnyRef
- Annotations
- @throws ( ... )
-
def
where
(
condition:
Column
)
:
DataFrame
Filters rows based on the specified conditional expression (similar to WHERE in SQL).
Filters rows based on the specified conditional expression (similar to WHERE in SQL). This is equivalent to calling filter .
For example:
// The following two result in the same SQL query: pricesDF.filter($"price" > 100) pricesDF.where($"price" > 100)
- condition
-
Filter condition defined as an expression on columns.
- returns
-
A filtered DataFrame
- Definition Classes
- DataFrame
- Since
-
0.1.0
-
def
withColumn
(
colName:
String
,
col:
Column
)
:
DataFrame
Returns a DataFrame with an additional column with the specified name (
colName
).Returns a DataFrame with an additional column with the specified name (
colName
). The column is computed by using the specified expression (col
).If a column with the same name already exists in the DataFrame, that column is replaced by the new column.
This example adds a new column named
mean_price
that contains the mean of the existingprice
column in the DataFrame.val dfWithMeanPriceCol = df.withColumn("mean_price", mean($"price"))
- colName
-
The name of the column to add or replace.
- col
-
The Column to add or replace.
- returns
- Definition Classes
- DataFrame
- Since
-
0.1.0
-
def
withColumns
(
colNames:
Seq
[
String
]
,
values:
Seq
[
Column
]
)
:
DataFrame
Returns a DataFrame with additional columns with the specified names (
colNames
).Returns a DataFrame with additional columns with the specified names (
colNames
). The columns are computed by using the specified expressions (cols
).If columns with the same names already exist in the DataFrame, those columns are replaced by the new columns.
This example adds new columns named
mean_price
andavg_price
that contain the mean and average of the existingprice
column.val dfWithAddedColumns = df.withColumn( Seq("mean_price", "avg_price"), Seq(mean($"price"), avg($"price") )
- colNames
-
A list of the names of the columns to add or replace.
- values
-
A list of the Column objects to add or replace.
- returns
- Definition Classes
- DataFrame
- Since
-
0.1.0
-
def
withPlan
(
plan:
LogicalPlan
)
:
DataFrame
- Attributes
- protected
- Definition Classes
- DataFrame
- Annotations
- @inline ()
-
def
write
:
DataFrameWriter
Returns a DataFrameWriter object that you can use to write the data in the DataFrame to any supported destination.
Returns a DataFrameWriter object that you can use to write the data in the DataFrame to any supported destination. The Default SaveMode for the returned DataFrameWriter is Append .
Example:
df.write.saveAsTable("table1")
- returns
- Definition Classes
- DataFrame
- Since
-
0.1.0
Deprecated Value Members
-
def
finalize
()
:
Unit
- Attributes
- protected[ lang ]
- Definition Classes
- AnyRef
- Annotations
- @throws ( classOf[java.lang.Throwable] ) @Deprecated
- Deprecated