snowflake.snowpark.DataFrame¶
- class snowflake.snowpark.DataFrame(session: Session | None = None, plan: LogicalPlan | None = None, is_cached: bool = False)[source]¶
Bases:
object
Represents a lazily-evaluated relational dataset that contains a collection of
Row
objects with columns defined by a schema (column name and type).A DataFrame is considered lazy because it encapsulates the computation or query required to produce a relational dataset. The computation is not performed until you call a method that performs an action (e.g.
collect()
).Creating a DataFrame
You can create a DataFrame in a number of different ways, as shown in the examples below.
- Creating tables and data to run the sample code:
>>> session.sql("create or replace temp table prices(product_id varchar, amount number(10, 2))").collect() [Row(status='Table PRICES successfully created.')] >>> session.sql("insert into prices values ('id1', 10.0), ('id2', 20.0)").collect() [Row(number of rows inserted=2)] >>> # Create a CSV file to demo load >>> import tempfile >>> with tempfile.NamedTemporaryFile(mode="w+t") as t: ... t.writelines(["id1, Product A", "\n" "id2, Product B"]) ... t.flush() ... create_stage_result = session.sql("create temp stage test_stage").collect() ... put_result = session.file.put(t.name, "@test_stage/test_dir")
- Example 1
Creating a DataFrame by reading a table in Snowflake:
>>> df_prices = session.table("prices")
- Example 2
Creating a DataFrame by reading files from a stage:
>>> from snowflake.snowpark.types import StructType, StructField, IntegerType, StringType >>> df_catalog = session.read.schema(StructType([StructField("id", StringType()), StructField("name", StringType())])).csv("@test_stage/test_dir") >>> df_catalog.show() --------------------- |"ID" |"NAME" | --------------------- |id1 | Product A | |id2 | Product B | ---------------------
- Example 3
Creating a DataFrame by specifying a sequence or a range:
>>> session.create_dataframe([(1, "one"), (2, "two")], schema=["col_a", "col_b"]).show() --------------------- |"COL_A" |"COL_B" | --------------------- |1 |one | |2 |two | --------------------- >>> session.range(1, 10, 2).to_df("col1").show() ---------- |"COL1" | ---------- |1 | |3 | |5 | |7 | |9 | ----------
- Example 4
Create a new DataFrame by applying transformations to other existing DataFrames:
>>> df_merged_data = df_catalog.join(df_prices, df_catalog["id"] == df_prices["product_id"])
Performing operations on a DataFrame
Broadly, the operations on DataFrame can be divided into two types:
Transformations produce a new DataFrame from one or more existing DataFrames. Note that transformations are lazy and don’t cause the DataFrame to be evaluated. If the API does not provide a method to express the SQL that you want to use, you can use
functions.sqlExpr()
as a workaround.Actions cause the DataFrame to be evaluated. When you call a method that performs an action, Snowpark sends the SQL query for the DataFrame to the server for evaluation.
Transforming a DataFrame
The following examples demonstrate how you can transform a DataFrame.
- Example 5
Using the
select()
method to select the columns that should be in the DataFrame (similar to adding aSELECT
clause):>>> # Return a new DataFrame containing the product_id and amount columns of the prices table. >>> # This is equivalent to: SELECT PRODUCT_ID, AMOUNT FROM PRICES; >>> df_price_ids_and_amounts = df_prices.select(col("product_id"), col("amount"))
- Example 6
Using the
Column.as_()
method to rename a column in a DataFrame (similar to usingSELECT col AS alias
):>>> # Return a new DataFrame containing the product_id column of the prices table as a column named >>> # item_id. This is equivalent to: SELECT PRODUCT_ID AS ITEM_ID FROM PRICES; >>> df_price_item_ids = df_prices.select(col("product_id").as_("item_id"))
- Example 7
Using the
filter()
method to filter data (similar to adding aWHERE
clause):>>> # Return a new DataFrame containing the row from the prices table with the ID 1. >>> # This is equivalent to: >>> # SELECT * FROM PRICES WHERE PRODUCT_ID = 1; >>> df_price1 = df_prices.filter((col("product_id") == 1))
- Example 8
Using the
sort()
method to specify the sort order of the data (similar to adding anORDER BY
clause):>>> # Return a new DataFrame for the prices table with the rows sorted by product_id. >>> # This is equivalent to: SELECT * FROM PRICES ORDER BY PRODUCT_ID; >>> df_sorted_prices = df_prices.sort(col("product_id"))
- Example 9
Using
agg()
method to aggregate results.>>> import snowflake.snowpark.functions as f >>> df_prices.agg(("amount", "sum")).collect() [Row(SUM(AMOUNT)=Decimal('30.00'))] >>> df_prices.agg(f.sum("amount")).collect() [Row(SUM(AMOUNT)=Decimal('30.00'))] >>> # rename the aggregation column name >>> df_prices.agg(f.sum("amount").alias("total_amount"), f.max("amount").alias("max_amount")).collect() [Row(TOTAL_AMOUNT=Decimal('30.00'), MAX_AMOUNT=Decimal('20.00'))]
- Example 10
Using the
group_by()
method to return aRelationalGroupedDataFrame
that you can use to group and aggregate results (similar to adding aGROUP BY
clause).RelationalGroupedDataFrame
provides methods for aggregating results, including:RelationalGroupedDataFrame.avg()
(equivalent to AVG(column))RelationalGroupedDataFrame.count()
(equivalent to COUNT())RelationalGroupedDataFrame.max()
(equivalent to MAX(column))RelationalGroupedDataFrame.median()
(equivalent to MEDIAN(column))RelationalGroupedDataFrame.min()
(equivalent to MIN(column))RelationalGroupedDataFrame.sum()
(equivalent to SUM(column))
>>> # Return a new DataFrame for the prices table that computes the sum of the prices by >>> # category. This is equivalent to: >>> # SELECT CATEGORY, SUM(AMOUNT) FROM PRICES GROUP BY CATEGORY >>> df_total_price_per_category = df_prices.group_by(col("product_id")).sum(col("amount")) >>> # Have multiple aggregation values with the group by >>> import snowflake.snowpark.functions as f >>> df_summary = df_prices.group_by(col("product_id")).agg(f.sum(col("amount")).alias("total_amount"), f.avg("amount")) >>> df_summary.show() ------------------------------------------------- |"PRODUCT_ID" |"TOTAL_AMOUNT" |"AVG(AMOUNT)" | ------------------------------------------------- |id1 |10.00 |10.00000000 | |id2 |20.00 |20.00000000 | -------------------------------------------------
- Example 11
Using windowing functions. Refer to
Window
for more details.>>> from snowflake.snowpark import Window >>> from snowflake.snowpark.functions import row_number >>> df_prices.with_column("price_rank", row_number().over(Window.order_by(col("amount").desc()))).show() ------------------------------------------ |"PRODUCT_ID" |"AMOUNT" |"PRICE_RANK" | ------------------------------------------ |id2 |20.00 |1 | |id1 |10.00 |2 | ------------------------------------------
- Example 12
Handling missing values. Refer to
DataFrameNaFunctions
for more details.>>> df = session.create_dataframe([[1, None, 3], [4, 5, None]], schema=["a", "b", "c"]) >>> df.na.fill({"b": 2, "c": 6}).show() ------------------- |"A" |"B" |"C" | ------------------- |1 |2 |3 | |4 |5 |6 | -------------------
Performing an action on a DataFrame
The following examples demonstrate how you can perform an action on a DataFrame.
- Example 13
Performing a query and returning an array of Rows:
>>> df_prices.collect() [Row(PRODUCT_ID='id1', AMOUNT=Decimal('10.00')), Row(PRODUCT_ID='id2', AMOUNT=Decimal('20.00'))]
- Example 14
Performing a query and print the results:
>>> df_prices.show() --------------------------- |"PRODUCT_ID" |"AMOUNT" | --------------------------- |id1 |10.00 | |id2 |20.00 | ---------------------------
- Example 15
Calculating statistics values. Refer to
DataFrameStatFunctions
for more details.>>> df = session.create_dataframe([[1, 2], [3, 4], [5, -1]], schema=["a", "b"]) >>> df.stat.corr("a", "b") -0.5960395606792697
- Example 16
Performing a query asynchronously and returning a list of
Row
objects:>>> df = session.create_dataframe([[float(4), 3, 5], [2.0, -4, 7], [3.0, 5, 6], [4.0, 6, 8]], schema=["a", "b", "c"]) >>> async_job = df.collect_nowait() >>> async_job.result() [Row(A=4.0, B=3, C=5), Row(A=2.0, B=-4, C=7), Row(A=3.0, B=5, C=6), Row(A=4.0, B=6, C=8)]
- Example 17
Performing a query and transforming it into
pandas.DataFrame
asynchronously:>>> async_job = df.to_pandas(block=False) >>> async_job.result() A B C 0 4.0 3 5 1 2.0 -4 7 2 3.0 5 6 3 4.0 6 8
Methods
agg
(*exprs)Aggregate the data in the DataFrame.
approxQuantile
(col, percentile, *[, ...])For a specified numeric column and a list of desired quantiles, returns an approximate value for the column at each of the desired quantiles.
approx_quantile
(col, percentile, *[, ...])For a specified numeric column and a list of desired quantiles, returns an approximate value for the column at each of the desired quantiles.
cache_result
(*[, statement_params])Caches the content of this DataFrame to create a new cached Table DataFrame.
col
(col_name)Returns a reference to a column in the DataFrame.
collect
()Executes the query representing this DataFrame and returns the result as a list of
Row
objects.collect_nowait
(*[, statement_params, ...])Executes the query representing this DataFrame asynchronously and returns: class:AsyncJob.
copy_into_table
(table_name, *[, files, ...])Executes a COPY INTO <table> command to load data from files in a stage location into a specified table.
corr
(col1, col2, *[, statement_params])Calculates the correlation coefficient for non-null pairs in two numeric columns.
count
()Executes the query representing this DataFrame and returns the number of rows in the result (similar to the COUNT function in SQL).
cov
(col1, col2, *[, statement_params])Calculates the sample covariance for non-null pairs in two numeric columns.
createOrReplaceTempView
(name, *[, ...])Creates a temporary view that returns the same results as this DataFrame.
createOrReplaceView
(name, *[, statement_params])Creates a view that captures the computation expressed by this DataFrame.
create_or_replace_dynamic_table
(name, *, ...)Creates a dynamic table that captures the computation expressed by this DataFrame.
create_or_replace_temp_view
(name, *[, ...])Creates a temporary view that returns the same results as this DataFrame.
create_or_replace_view
(name, *[, ...])Creates a view that captures the computation expressed by this DataFrame.
crossJoin
(right, *[, lsuffix, rsuffix])Performs a cross join, which returns the Cartesian product of the current
DataFrame
and anotherDataFrame
(right
).cross_join
(right, *[, lsuffix, rsuffix])Performs a cross join, which returns the Cartesian product of the current
DataFrame
and anotherDataFrame
(right
).crosstab
(col1, col2, *[, statement_params])Computes a pair-wise frequency table (a
contingency table
) for the specified columns.cube
(*cols)Performs a SQL GROUP BY CUBE.
describe
(*cols)Computes basic statistics for numeric columns, which includes
count
,mean
,stddev
,min
, andmax
.distinct
()Returns a new DataFrame that contains only the rows with distinct values from the current DataFrame.
drop
(*cols)Returns a new DataFrame that excludes the columns with the specified names from the output.
dropDuplicates
(*subset)Creates a new DataFrame by removing duplicated rows on given subset of columns.
drop_duplicates
(*subset)Creates a new DataFrame by removing duplicated rows on given subset of columns.
dropna
([how, thresh, subset])Returns a new DataFrame that excludes all rows containing fewer than a specified number of non-null and non-NaN values in the specified columns.
except_
(other)Returns a new DataFrame that contains all the rows from the current DataFrame except for the rows that also appear in the
other
DataFrame.explain
()Prints the list of queries that will be executed to evaluate this DataFrame.
fillna
(value[, subset])Returns a new DataFrame that replaces all null and NaN values in the specified columns with the values provided.
filter
(expr)Filters rows based on the specified conditional expression (similar to WHERE in SQL).
first
()Executes the query representing this DataFrame and returns the first
n
rows of the results.flatten
(input[, path, outer, recursive, mode])Flattens (explodes) compound values into multiple rows.
groupBy
(*cols)Groups rows by the columns specified by expressions (similar to GROUP BY in SQL).
group_by
(*cols)Groups rows by the columns specified by expressions (similar to GROUP BY in SQL).
group_by_grouping_sets
(*grouping_sets)Performs a SQL GROUP BY GROUPING SETS.
intersect
(other)Returns a new DataFrame that contains the intersection of rows from the current DataFrame and another DataFrame (
other
).join
(right[, on, how, lsuffix, rsuffix])Performs a join of the specified type (
how
) with the current DataFrame and another DataFrame (right
) on a list of columns (on
).join_table_function
(func, *func_arguments, ...)Lateral joins the current DataFrame with the output of the specified table function.
limit
(n[, offset])Returns a new DataFrame that contains at most
n
rows from the current DataFrame, skippingoffset
rows from the beginning (similar to LIMIT and OFFSET in SQL).minus
(other)Returns a new DataFrame that contains all the rows from the current DataFrame except for the rows that also appear in the
other
DataFrame.natural_join
(right[, how])Performs a natural join of the specified type (
how
) with the current DataFrame and another DataFrame (right
).orderBy
(*cols[, ascending])Sorts a DataFrame by the specified expressions (similar to ORDER BY in SQL).
order_by
(*cols[, ascending])Sorts a DataFrame by the specified expressions (similar to ORDER BY in SQL).
pivot
(pivot_col, values)Rotates this DataFrame by turning the unique values from one column in the input expression into multiple columns and aggregating results where required on any remaining column values.
randomSplit
(weights[, seed, statement_params])Randomly splits the current DataFrame into separate DataFrames, using the specified weights.
random_split
(weights[, seed, statement_params])Randomly splits the current DataFrame into separate DataFrames, using the specified weights.
rename
(existing, new)Returns a DataFrame with the specified column
existing
renamed asnew
.replace
(to_replace[, value, subset])Returns a new DataFrame that replaces values in the specified columns.
rollup
(*cols)Performs a SQL GROUP BY ROLLUP.
sample
([frac, n])Samples rows based on either the number of rows to be returned or a percentage of rows to be returned.
sampleBy
(col, fractions)Returns a DataFrame containing a stratified sample without replacement, based on a
dict
that specifies the fraction for each stratum.sample_by
(col, fractions)Returns a DataFrame containing a stratified sample without replacement, based on a
dict
that specifies the fraction for each stratum.select
(*cols)Returns a new DataFrame with the specified Column expressions as output (similar to SELECT in SQL).
selectExpr
(*exprs)Projects a set of SQL expressions and returns a new
DataFrame
.select_expr
(*exprs)Projects a set of SQL expressions and returns a new
DataFrame
.show
([n, max_width, statement_params])Evaluates this DataFrame and prints out the first
n
rows with the specified maximum number of characters per column.sort
(*cols[, ascending])Sorts a DataFrame by the specified expressions (similar to ORDER BY in SQL).
subtract
(other)Returns a new DataFrame that contains all the rows from the current DataFrame except for the rows that also appear in the
other
DataFrame.take
([n, statement_params, block])Executes the query representing this DataFrame and returns the first
n
rows of the results.toDF
(*names)Creates a new DataFrame containing columns with the specified names.
toLocalIterator
(*[, statement_params, block])Executes the query representing this DataFrame and returns an iterator of
Row
objects that you can use to retrieve the results.toPandas
(*[, statement_params, block])Executes the query representing this DataFrame and returns the result as a Pandas DataFrame.
to_df
(*names)Creates a new DataFrame containing columns with the specified names.
Executes the query representing this DataFrame and returns an iterator of
Row
objects that you can use to retrieve the results.Executes the query representing this DataFrame and returns the result as a Pandas DataFrame.
Executes the query representing this DataFrame and returns an iterator of Pandas dataframes (containing a subset of rows) that you can use to retrieve the results.
union
(other)Returns a new DataFrame that contains all the rows in the current DataFrame and another DataFrame (
other
), excluding any duplicate rows.unionAll
(other)Returns a new DataFrame that contains all the rows in the current DataFrame and another DataFrame (
other
), including any duplicate rows.unionAllByName
(other)Returns a new DataFrame that contains all the rows in the current DataFrame and another DataFrame (
other
), including any duplicate rows.unionByName
(other)Returns a new DataFrame that contains all the rows in the current DataFrame and another DataFrame (
other
), excluding any duplicate rows.union_all
(other)Returns a new DataFrame that contains all the rows in the current DataFrame and another DataFrame (
other
), including any duplicate rows.union_all_by_name
(other)Returns a new DataFrame that contains all the rows in the current DataFrame and another DataFrame (
other
), including any duplicate rows.union_by_name
(other)Returns a new DataFrame that contains all the rows in the current DataFrame and another DataFrame (
other
), excluding any duplicate rows.unpivot
(value_column, name_column, column_list)Rotates a table by transforming columns into rows.
where
(expr)Filters rows based on the specified conditional expression (similar to WHERE in SQL).
withColumn
(col_name, col)Returns a DataFrame with an additional column with the specified name
col_name
.withColumnRenamed
(existing, new)Returns a DataFrame with the specified column
existing
renamed asnew
.with_column
(col_name, col)Returns a DataFrame with an additional column with the specified name
col_name
.with_column_renamed
(existing, new)Returns a DataFrame with the specified column
existing
renamed asnew
.with_columns
(col_names, values)Returns a DataFrame with additional columns with the specified names
col_names
.Attributes
Returns all column names as a list.
dtypes
Returns a
DataFrameNaFunctions
object that provides functions for handling missing values in the DataFrame.Returns a
dict
that contains a list of queries that will be executed to evaluate this DataFrame with the key queries, and a list of post-execution actions (e.g., queries to clean up temporary objects) with the key post_actions.The definition of the columns in this DataFrame (the "relational schema" for the DataFrame).
Returns a new
DataFrameWriter
object that you can use to write the data in theDataFrame
to a Snowflake database or a stage locationWhether the dataframe is cached.