snowflake.snowpark.DataFrame¶

class snowflake.snowpark.DataFrame(session: Optional[Session] = None, plan: Optional[LogicalPlan] = None, is_cached: bool = False)[source]¶

Bases: object

Represents a lazily-evaluated relational dataset that contains a collection of Row objects with columns defined by a schema (column name and type).

A DataFrame is considered lazy because it encapsulates the computation or query required to produce a relational dataset. The computation is not performed until you call a method that performs an action (e.g. collect()).

Creating a DataFrame

You can create a DataFrame in a number of different ways, as shown in the examples below.

Creating tables and data to run the sample code:

>>> session.sql("create or replace temp table prices(product_id varchar, amount number(10, 2))").collect()
[Row(status='Table PRICES successfully created.')]
>>> session.sql("insert into prices values ('id1', 10.0), ('id2', 20.0)").collect()
[Row(number of rows inserted=2)]
>>> # Create a CSV file to demo load
>>> import tempfile
>>> with tempfile.NamedTemporaryFile(mode="w+t") as t:
...     t.writelines(["id1, Product A", "\n" "id2, Product B"])
...     t.flush()
...     create_stage_result = session.sql("create temp stage test_stage").collect()
...     put_result = session.file.put(t.name, "@test_stage/test_dir")

Example 1

Creating a DataFrame by reading a table in Snowflake:

>>> df_prices = session.table("prices")

Example 2

Creating a DataFrame by reading files from a stage:

>>> from snowflake.snowpark.types import StructType, StructField, IntegerType, StringType
>>> df_catalog = session.read.schema(StructType([StructField("id", StringType()), StructField("name", StringType())])).csv("@test_stage/test_dir")
>>> df_catalog.show()
---------------------
|"ID"  |"NAME"      |
---------------------
|id1   | Product A  |
|id2   | Product B  |
---------------------

Example 3

Creating a DataFrame by specifying a sequence or a range:

>>> session.create_dataframe([(1, "one"), (2, "two")], schema=["col_a", "col_b"]).show()
---------------------
|"COL_A"  |"COL_B"  |
---------------------
|1        |one      |
|2        |two      |
---------------------

>>> session.range(1, 10, 2).to_df("col1").show()
----------
|"COL1"  |
----------
|1       |
|3       |
|5       |
|7       |
|9       |
----------

Example 4

Create a new DataFrame by applying transformations to other existing DataFrames:

>>> df_merged_data = df_catalog.join(df_prices, df_catalog["id"] == df_prices["product_id"])

Performing operations on a DataFrame

Broadly, the operations on DataFrame can be divided into two types:

Transformations produce a new DataFrame from one or more existing DataFrames. Note that transformations are lazy and don’t cause the DataFrame to be evaluated. If the API does not provide a method to express the SQL that you want to use, you can use functions.sqlExpr() as a workaround.
Actions cause the DataFrame to be evaluated. When you call a method that performs an action, Snowpark sends the SQL query for the DataFrame to the server for evaluation.

Transforming a DataFrame

The following examples demonstrate how you can transform a DataFrame.

Example 5

Using the select() method to select the columns that should be in the DataFrame (similar to adding a SELECT clause):

>>> # Return a new DataFrame containing the product_id and amount columns of the prices table.
>>> # This is equivalent to: SELECT PRODUCT_ID, AMOUNT FROM PRICES;
>>> df_price_ids_and_amounts = df_prices.select(col("product_id"), col("amount"))

Example 6

Using the Column.as_() method to rename a column in a DataFrame (similar to using SELECT col AS alias):

>>> # Return a new DataFrame containing the product_id column of the prices table as a column named
>>> # item_id. This is equivalent to: SELECT PRODUCT_ID AS ITEM_ID FROM PRICES;
>>> df_price_item_ids = df_prices.select(col("product_id").as_("item_id"))

Example 7

Using the filter() method to filter data (similar to adding a WHERE clause):

>>> # Return a new DataFrame containing the row from the prices table with the ID 1.
>>> # This is equivalent to:
>>> # SELECT * FROM PRICES WHERE PRODUCT_ID = 1;
>>> df_price1 = df_prices.filter((col("product_id") == 1))

Example 8

Using the sort() method to specify the sort order of the data (similar to adding an ORDER BY clause):

>>> # Return a new DataFrame for the prices table with the rows sorted by product_id.
>>> # This is equivalent to: SELECT * FROM PRICES ORDER BY PRODUCT_ID;
>>> df_sorted_prices = df_prices.sort(col("product_id"))

Example 9

Using agg() method to aggregate results.

>>> import snowflake.snowpark.functions as f
>>> df_prices.agg(("amount", "sum")).collect()
[Row(SUM(AMOUNT)=Decimal('30.00'))]
>>> df_prices.agg(f.sum("amount")).collect()
[Row(SUM(AMOUNT)=Decimal('30.00'))]
>>> # rename the aggregation column name
>>> df_prices.agg(f.sum("amount").alias("total_amount"), f.max("amount").alias("max_amount")).collect()
[Row(TOTAL_AMOUNT=Decimal('30.00'), MAX_AMOUNT=Decimal('20.00'))]

Example 10

Using the group_by() method to return a RelationalGroupedDataFrame that you can use to group and aggregate results (similar to adding a GROUP BY clause).

RelationalGroupedDataFrame provides methods for aggregating results, including:

RelationalGroupedDataFrame.avg() (equivalent to AVG(column))
RelationalGroupedDataFrame.count() (equivalent to COUNT())
RelationalGroupedDataFrame.max() (equivalent to MAX(column))
RelationalGroupedDataFrame.median() (equivalent to MEDIAN(column))
RelationalGroupedDataFrame.min() (equivalent to MIN(column))
RelationalGroupedDataFrame.sum() (equivalent to SUM(column))

>>> # Return a new DataFrame for the prices table that computes the sum of the prices by
>>> # category. This is equivalent to:
>>> #  SELECT CATEGORY, SUM(AMOUNT) FROM PRICES GROUP BY CATEGORY
>>> df_total_price_per_category = df_prices.group_by(col("product_id")).sum(col("amount"))
>>> # Have multiple aggregation values with the group by
>>> import snowflake.snowpark.functions as f
>>> df_summary = df_prices.group_by(col("product_id")).agg(f.sum(col("amount")).alias("total_amount"), f.avg("amount"))
>>> df_summary.show()
-------------------------------------------------
|"PRODUCT_ID"  |"TOTAL_AMOUNT"  |"AVG(AMOUNT)"  |
-------------------------------------------------
|id1           |10.00           |10.00000000    |
|id2           |20.00           |20.00000000    |
-------------------------------------------------

Example 11

Using windowing functions. Refer to Window for more details.

>>> from snowflake.snowpark import Window
>>> from snowflake.snowpark.functions import row_number
>>> df_prices.with_column("price_rank",  row_number().over(Window.order_by(col("amount").desc()))).show()
------------------------------------------
|"PRODUCT_ID"  |"AMOUNT"  |"PRICE_RANK"  |
------------------------------------------
|id2           |20.00     |1             |
|id1           |10.00     |2             |
------------------------------------------

Example 12

Handling missing values. Refer to DataFrameNaFunctions for more details.

>>> df = session.create_dataframe([[1, None, 3], [4, 5, None]], schema=["a", "b", "c"])
>>> df.na.fill({"b": 2, "c": 6}).show()
-------------------
|"A"  |"B"  |"C"  |
-------------------
|1    |2    |3    |
|4    |5    |6    |
-------------------

Performing an action on a DataFrame

The following examples demonstrate how you can perform an action on a DataFrame.

Example 13

Performing a query and returning an array of Rows:

>>> df_prices.collect()
[Row(PRODUCT_ID='id1', AMOUNT=Decimal('10.00')), Row(PRODUCT_ID='id2', AMOUNT=Decimal('20.00'))]

Example 14

Performing a query and print the results:

>>> df_prices.show()
---------------------------
|"PRODUCT_ID"  |"AMOUNT"  |
---------------------------
|id1           |10.00     |
|id2           |20.00     |
---------------------------

Example 15

Calculating statistics values. Refer to DataFrameStatFunctions for more details.

>>> df = session.create_dataframe([[1, 2], [3, 4], [5, -1]], schema=["a", "b"])
>>> df.stat.corr("a", "b")
-0.5960395606792697

Example 16

Performing a query asynchronously and returning a list of Row objects:

>>> df = session.create_dataframe([[float(4), 3, 5], [2.0, -4, 7], [3.0, 5, 6], [4.0, 6, 8]], schema=["a", "b", "c"])
>>> async_job = df.collect_nowait()
>>> async_job.result()
[Row(A=4.0, B=3, C=5), Row(A=2.0, B=-4, C=7), Row(A=3.0, B=5, C=6), Row(A=4.0, B=6, C=8)]

Example 17

Performing a query and transforming it into pandas.DataFrame asynchronously:

>>> async_job = df.to_pandas(block=False)
>>> async_job.result()
     A  B  C
0  4.0  3  5
1  2.0 -4  7
2  3.0  5  6
3  4.0  6  8

Methods


`agg`(*exprs)	Aggregate the data in the DataFrame.
`alias`(name)	Returns an aliased dataframe in which the columns can now be referenced to using col(<df alias>, <column name>).
`approxQuantile`(col, percentile, *[, ...])	For a specified numeric column and a list of desired quantiles, returns an approximate value for the column at each of the desired quantiles.
`approx_quantile`(col, percentile, *[, ...])	For a specified numeric column and a list of desired quantiles, returns an approximate value for the column at each of the desired quantiles.
`cache_result`(*[, statement_params])	Caches the content of this DataFrame to create a new cached Table DataFrame.
`col`(col_name)	Returns a reference to a column in the DataFrame.
`collect`()	Executes the query representing this DataFrame and returns the result as a list of `Row` objects.
`collect_nowait`(*[, statement_params, ...])	Executes the query representing this DataFrame asynchronously and returns: class:AsyncJob.
`copy_into_table`(table_name, *[, files, ...])	Executes a COPY INTO <table> command to load data from files in a stage location into a specified table.
`corr`(col1, col2, *[, statement_params])	Calculates the correlation coefficient for non-null pairs in two numeric columns.
`count`()	Executes the query representing this DataFrame and returns the number of rows in the result (similar to the COUNT function in SQL).
`cov`(col1, col2, *[, statement_params])	Calculates the sample covariance for non-null pairs in two numeric columns.
`createOrReplaceTempView`(name, *[, ...])	Creates a temporary view that returns the same results as this DataFrame.
`createOrReplaceView`(name, *[, statement_params])	Creates a view that captures the computation expressed by this DataFrame.
`create_or_replace_dynamic_table`(name, *, ...)	Creates a dynamic table that captures the computation expressed by this DataFrame.
`create_or_replace_temp_view`(name, *[, ...])	Creates a temporary view that returns the same results as this DataFrame.
`create_or_replace_view`(name, *[, ...])	Creates a view that captures the computation expressed by this DataFrame.
`crossJoin`(right, *[, lsuffix, rsuffix])	Performs a cross join, which returns the Cartesian product of the current `DataFrame` and another `DataFrame` (`right`).
`cross_join`(right, *[, lsuffix, rsuffix])	Performs a cross join, which returns the Cartesian product of the current `DataFrame` and another `DataFrame` (`right`).
`crosstab`(col1, col2, *[, statement_params])	Computes a pair-wise frequency table (a `contingency table`) for the specified columns.
`cube`(*cols)	Performs a SQL GROUP BY CUBE.
`describe`(*cols)	Computes basic statistics for numeric columns, which includes `count`, `mean`, `stddev`, `min`, and `max`.
`distinct`()	Returns a new DataFrame that contains only the rows with distinct values from the current DataFrame.
`drop`(*cols)	Returns a new DataFrame that excludes the columns with the specified names from the output.
`dropDuplicates`(*subset)	Creates a new DataFrame by removing duplicated rows on given subset of columns.
`drop_duplicates`(*subset)	Creates a new DataFrame by removing duplicated rows on given subset of columns.
`dropna`([how, thresh, subset])	Returns a new DataFrame that excludes all rows containing fewer than a specified number of non-null and non-NaN values in the specified columns.
`except_`(other)	Returns a new DataFrame that contains all the rows from the current DataFrame except for the rows that also appear in the `other` DataFrame.
`explain`()	Prints the list of queries that will be executed to evaluate this DataFrame.
`fillna`(value[, subset])	Returns a new DataFrame that replaces all null and NaN values in the specified columns with the values provided.
`filter`(expr)	Filters rows based on the specified conditional expression (similar to WHERE in SQL).
`first`()	Executes the query representing this DataFrame and returns the first `n` rows of the results.
`flatten`(input[, path, outer, recursive, mode])	Flattens (explodes) compound values into multiple rows.
`groupBy`(*cols)	Groups rows by the columns specified by expressions (similar to GROUP BY in SQL).
`group_by`(*cols)	Groups rows by the columns specified by expressions (similar to GROUP BY in SQL).
`group_by_grouping_sets`(*grouping_sets)	Performs a SQL GROUP BY GROUPING SETS.
`intersect`(other)	Returns a new DataFrame that contains the intersection of rows from the current DataFrame and another DataFrame (`other`).
`join`(right[, on, how, lsuffix, rsuffix])	Performs a join of the specified type (`how`) with the current DataFrame and another DataFrame (`right`) on a list of columns (`on`).
`join_table_function`(func, *func_arguments, ...)	Lateral joins the current DataFrame with the output of the specified table function.
`limit`(n[, offset])	Returns a new DataFrame that contains at most `n` rows from the current DataFrame, skipping `offset` rows from the beginning (similar to LIMIT and OFFSET in SQL).
`minus`(other)	Returns a new DataFrame that contains all the rows from the current DataFrame except for the rows that also appear in the `other` DataFrame.
`natural_join`(right[, how])	Performs a natural join of the specified type (`how`) with the current DataFrame and another DataFrame (`right`).
`orderBy`(*cols[, ascending])	Sorts a DataFrame by the specified expressions (similar to ORDER BY in SQL).
`order_by`(*cols[, ascending])	Sorts a DataFrame by the specified expressions (similar to ORDER BY in SQL).
`pivot`(pivot_col, values)	Rotates this DataFrame by turning the unique values from one column in the input expression into multiple columns and aggregating results where required on any remaining column values.
`printSchema`()
`print_schema`()
`randomSplit`(weights[, seed, statement_params])	Randomly splits the current DataFrame into separate DataFrames, using the specified weights.
`random_split`(weights[, seed, statement_params])	Randomly splits the current DataFrame into separate DataFrames, using the specified weights.
`rename`(col_or_mapper[, new_column])	Returns a DataFrame with the specified column `col_or_mapper` renamed as `new_column`.
`replace`(to_replace[, value, subset])	Returns a new DataFrame that replaces values in the specified columns.
`rollup`(*cols)	Performs a SQL GROUP BY ROLLUP.
`sample`([frac, n])	Samples rows based on either the number of rows to be returned or a percentage of rows to be returned.
`sampleBy`(col, fractions)	Returns a DataFrame containing a stratified sample without replacement, based on a `dict` that specifies the fraction for each stratum.
`sample_by`(col, fractions)	Returns a DataFrame containing a stratified sample without replacement, based on a `dict` that specifies the fraction for each stratum.
`select`(*cols)	Returns a new DataFrame with the specified Column expressions as output (similar to SELECT in SQL).
`selectExpr`(*exprs)	Projects a set of SQL expressions and returns a new `DataFrame`.
`select_expr`(*exprs)	Projects a set of SQL expressions and returns a new `DataFrame`.
`show`([n, max_width, statement_params])	Evaluates this DataFrame and prints out the first `n` rows with the specified maximum number of characters per column.
`sort`(*cols[, ascending])	Sorts a DataFrame by the specified expressions (similar to ORDER BY in SQL).
`subtract`(other)	Returns a new DataFrame that contains all the rows from the current DataFrame except for the rows that also appear in the `other` DataFrame.
`take`([n, statement_params, block])	Executes the query representing this DataFrame and returns the first `n` rows of the results.
`toDF`(*names)	Creates a new DataFrame containing columns with the specified names.
`toLocalIterator`(*[, statement_params, ...])	Executes the query representing this DataFrame and returns an iterator of `Row` objects that you can use to retrieve the results.
`toPandas`(*[, statement_params, block])	Executes the query representing this DataFrame and returns the result as a Pandas DataFrame.
`to_df`(*names)	Creates a new DataFrame containing columns with the specified names.
`to_local_iterator`()	Executes the query representing this DataFrame and returns an iterator of `Row` objects that you can use to retrieve the results.
`to_pandas`()	Executes the query representing this DataFrame and returns the result as a Pandas DataFrame.
`to_pandas_batches`()	Executes the query representing this DataFrame and returns an iterator of Pandas dataframes (containing a subset of rows) that you can use to retrieve the results.
`union`(other)	Returns a new DataFrame that contains all the rows in the current DataFrame and another DataFrame (`other`), excluding any duplicate rows.
`unionAll`(other)	Returns a new DataFrame that contains all the rows in the current DataFrame and another DataFrame (`other`), including any duplicate rows.
`unionAllByName`(other)	Returns a new DataFrame that contains all the rows in the current DataFrame and another DataFrame (`other`), including any duplicate rows.
`unionByName`(other)	Returns a new DataFrame that contains all the rows in the current DataFrame and another DataFrame (`other`), excluding any duplicate rows.
`union_all`(other)	Returns a new DataFrame that contains all the rows in the current DataFrame and another DataFrame (`other`), including any duplicate rows.
`union_all_by_name`(other)	Returns a new DataFrame that contains all the rows in the current DataFrame and another DataFrame (`other`), including any duplicate rows.
`union_by_name`(other)	Returns a new DataFrame that contains all the rows in the current DataFrame and another DataFrame (`other`), excluding any duplicate rows.
`unpivot`(value_column, name_column, column_list)	Rotates a table by transforming columns into rows.
`where`(expr)	Filters rows based on the specified conditional expression (similar to WHERE in SQL).
`withColumn`(col_name, col)	Returns a DataFrame with an additional column with the specified name `col_name`.
`withColumnRenamed`(existing, new)	Returns a DataFrame with the specified column `existing` renamed as `new`.
`with_column`(col_name, col)	Returns a DataFrame with an additional column with the specified name `col_name`.
`with_column_renamed`(existing, new)	Returns a DataFrame with the specified column `existing` renamed as `new`.
`with_columns`(col_names, values)	Returns a DataFrame with additional columns with the specified names `col_names`.

Attributes


`columns`	Returns all column names as a list.
`dtypes`
`na`	Returns a `DataFrameNaFunctions` object that provides functions for handling missing values in the DataFrame.
`queries`	Returns a `dict` that contains a list of queries that will be executed to evaluate this DataFrame with the key queries, and a list of post-execution actions (e.g., queries to clean up temporary objects) with the key post_actions.
`schema`	The definition of the columns in this DataFrame (the "relational schema" for the DataFrame).
`session`	Returns a `snowflake.snowpark.Session` object that provides access to the session the current DataFrame is relying on.
`stat`
`write`	Returns a new `DataFrameWriter` object that you can use to write the data in the `DataFrame` to a Snowflake database or a stage location
`is_cached`	Whether the dataframe is cached.