You are viewing documentation about an older version (1.2.0). View latest version

snowflake.snowpark.DataFrame¶

class snowflake.snowpark.DataFrame(session: Session | None = None, plan: LogicalPlan | None = None, is_cached: bool = False)[source]¶

Bases: object

Represents a lazily-evaluated relational dataset that contains a collection of Row objects with columns defined by a schema (column name and type).

A DataFrame is considered lazy because it encapsulates the computation or query required to produce a relational dataset. The computation is not performed until you call a method that performs an action (e.g. collect()).

Creating a DataFrame

You can create a DataFrame in a number of different ways, as shown in the examples below.

Creating tables and data to run the sample code:
>>> session.sql("create or replace temp table prices(product_id varchar, amount number(10, 2))").collect()
[Row(status='Table PRICES successfully created.')]
>>> session.sql("insert into prices values ('id1', 10.0), ('id2', 20.0)").collect()
[Row(number of rows inserted=2)]
>>> # Create a CSV file to demo load
>>> import tempfile
>>> with tempfile.NamedTemporaryFile(mode="w+t") as t:
...     t.writelines(["id1, Product A", "\n" "id2, Product B"])
...     t.flush()
...     create_stage_result = session.sql("create temp stage test_stage").collect()
...     put_result = session.file.put(t.name, "@test_stage/test_dir")
Copy
Example 1

Creating a DataFrame by reading a table in Snowflake:

>>> df_prices = session.table("prices")
Copy
Example 2

Creating a DataFrame by reading files from a stage:

>>> from snowflake.snowpark.types import StructType, StructField, IntegerType, StringType
>>> df_catalog = session.read.schema(StructType([StructField("id", StringType()), StructField("name", StringType())])).csv("@test_stage/test_dir")
>>> df_catalog.show()
---------------------
|"ID"  |"NAME"      |
---------------------
|id1   | Product A  |
|id2   | Product B  |
---------------------
Copy
Example 3

Creating a DataFrame by specifying a sequence or a range:

>>> session.create_dataframe([(1, "one"), (2, "two")], schema=["col_a", "col_b"]).show()
---------------------
|"COL_A"  |"COL_B"  |
---------------------
|1        |one      |
|2        |two      |
---------------------

>>> session.range(1, 10, 2).to_df("col1").show()
----------
|"COL1"  |
----------
|1       |
|3       |
|5       |
|7       |
|9       |
----------
Copy
Example 4

Create a new DataFrame by applying transformations to other existing DataFrames:

>>> df_merged_data = df_catalog.join(df_prices, df_catalog["id"] == df_prices["product_id"])
Copy

Performing operations on a DataFrame

Broadly, the operations on DataFrame can be divided into two types:

  • Transformations produce a new DataFrame from one or more existing DataFrames. Note that transformations are lazy and don’t cause the DataFrame to be evaluated. If the API does not provide a method to express the SQL that you want to use, you can use functions.sqlExpr() as a workaround.

  • Actions cause the DataFrame to be evaluated. When you call a method that performs an action, Snowpark sends the SQL query for the DataFrame to the server for evaluation.

Transforming a DataFrame

The following examples demonstrate how you can transform a DataFrame.

Example 5

Using the select() method to select the columns that should be in the DataFrame (similar to adding a SELECT clause):

>>> # Return a new DataFrame containing the product_id and amount columns of the prices table.
>>> # This is equivalent to: SELECT PRODUCT_ID, AMOUNT FROM PRICES;
>>> df_price_ids_and_amounts = df_prices.select(col("product_id"), col("amount"))
Copy
Example 6

Using the Column.as_() method to rename a column in a DataFrame (similar to using SELECT col AS alias):

>>> # Return a new DataFrame containing the product_id column of the prices table as a column named
>>> # item_id. This is equivalent to: SELECT PRODUCT_ID AS ITEM_ID FROM PRICES;
>>> df_price_item_ids = df_prices.select(col("product_id").as_("item_id"))
Copy
Example 7

Using the filter() method to filter data (similar to adding a WHERE clause):

>>> # Return a new DataFrame containing the row from the prices table with the ID 1.
>>> # This is equivalent to:
>>> # SELECT * FROM PRICES WHERE PRODUCT_ID = 1;
>>> df_price1 = df_prices.filter((col("product_id") == 1))
Copy
Example 8

Using the sort() method to specify the sort order of the data (similar to adding an ORDER BY clause):

>>> # Return a new DataFrame for the prices table with the rows sorted by product_id.
>>> # This is equivalent to: SELECT * FROM PRICES ORDER BY PRODUCT_ID;
>>> df_sorted_prices = df_prices.sort(col("product_id"))
Copy
Example 9

Using agg() method to aggregate results.

>>> import snowflake.snowpark.functions as f
>>> df_prices.agg(("amount", "sum")).collect()
[Row(SUM(AMOUNT)=Decimal('30.00'))]
>>> df_prices.agg(f.sum("amount")).collect()
[Row(SUM(AMOUNT)=Decimal('30.00'))]
>>> # rename the aggregation column name
>>> df_prices.agg(f.sum("amount").alias("total_amount"), f.max("amount").alias("max_amount")).collect()
[Row(TOTAL_AMOUNT=Decimal('30.00'), MAX_AMOUNT=Decimal('20.00'))]
Copy
Example 10

Using the group_by() method to return a RelationalGroupedDataFrame that you can use to group and aggregate results (similar to adding a GROUP BY clause).

RelationalGroupedDataFrame provides methods for aggregating results, including:

>>> # Return a new DataFrame for the prices table that computes the sum of the prices by
>>> # category. This is equivalent to:
>>> #  SELECT CATEGORY, SUM(AMOUNT) FROM PRICES GROUP BY CATEGORY
>>> df_total_price_per_category = df_prices.group_by(col("product_id")).sum(col("amount"))
>>> # Have multiple aggregation values with the group by
>>> import snowflake.snowpark.functions as f
>>> df_summary = df_prices.group_by(col("product_id")).agg(f.sum(col("amount")).alias("total_amount"), f.avg("amount"))
>>> df_summary.show()
-------------------------------------------------
|"PRODUCT_ID"  |"TOTAL_AMOUNT"  |"AVG(AMOUNT)"  |
-------------------------------------------------
|id1           |10.00           |10.00000000    |
|id2           |20.00           |20.00000000    |
-------------------------------------------------
Copy
Example 11

Using windowing functions. Refer to Window for more details.

>>> from snowflake.snowpark import Window
>>> from snowflake.snowpark.functions import row_number
>>> df_prices.with_column("price_rank",  row_number().over(Window.order_by(col("amount").desc()))).show()
------------------------------------------
|"PRODUCT_ID"  |"AMOUNT"  |"PRICE_RANK"  |
------------------------------------------
|id2           |20.00     |1             |
|id1           |10.00     |2             |
------------------------------------------
Copy
Example 12

Handling missing values. Refer to DataFrameNaFunctions for more details.

>>> df = session.create_dataframe([[1, None, 3], [4, 5, None]], schema=["a", "b", "c"])
>>> df.na.fill({"b": 2, "c": 6}).show()
-------------------
|"A"  |"B"  |"C"  |
-------------------
|1    |2    |3    |
|4    |5    |6    |
-------------------
Copy

Performing an action on a DataFrame

The following examples demonstrate how you can perform an action on a DataFrame.

Example 13

Performing a query and returning an array of Rows:

>>> df_prices.collect()
[Row(PRODUCT_ID='id1', AMOUNT=Decimal('10.00')), Row(PRODUCT_ID='id2', AMOUNT=Decimal('20.00'))]
Copy
Example 14

Performing a query and print the results:

>>> df_prices.show()
---------------------------
|"PRODUCT_ID"  |"AMOUNT"  |
---------------------------
|id1           |10.00     |
|id2           |20.00     |
---------------------------
Copy
Example 15

Calculating statistics values. Refer to DataFrameStatFunctions for more details.

>>> df = session.create_dataframe([[1, 2], [3, 4], [5, -1]], schema=["a", "b"])
>>> df.stat.corr("a", "b")
-0.5960395606792697
Copy
Example 16

Performing a query asynchronously and returning a list of Row objects:

>>> df = session.create_dataframe([[float(4), 3, 5], [2.0, -4, 7], [3.0, 5, 6], [4.0, 6, 8]], schema=["a", "b", "c"])
>>> async_job = df.collect_nowait()
>>> async_job.result()
[Row(A=4.0, B=3, C=5), Row(A=2.0, B=-4, C=7), Row(A=3.0, B=5, C=6), Row(A=4.0, B=6, C=8)]
Copy
Example 17

Performing a query and transforming it into pandas.DataFrame asynchronously:

>>> async_job = df.to_pandas(block=False)
>>> async_job.result()
     A  B  C
0  4.0  3  5
1  2.0 -4  7
2  3.0  5  6
3  4.0  6  8
Copy

Methods

agg(*exprs)

Aggregate the data in the DataFrame.

approxQuantile(col, percentile, *[, ...])

For a specified numeric column and a list of desired quantiles, returns an approximate value for the column at each of the desired quantiles.

approx_quantile(col, percentile, *[, ...])

For a specified numeric column and a list of desired quantiles, returns an approximate value for the column at each of the desired quantiles.

cache_result(*[, statement_params])

Caches the content of this DataFrame to create a new cached Table DataFrame.

col(col_name)

Returns a reference to a column in the DataFrame.

collect()

Executes the query representing this DataFrame and returns the result as a list of Row objects.

collect_nowait(*[, statement_params])

Executes the query representing this DataFrame asynchronously and returns: class:AsyncJob.

copy_into_table(table_name, *[, files, ...])

Executes a COPY INTO <table> command to load data from files in a stage location into a specified table.

corr(col1, col2, *[, statement_params])

Calculates the correlation coefficient for non-null pairs in two numeric columns.

count()

Executes the query representing this DataFrame and returns the number of rows in the result (similar to the COUNT function in SQL).

cov(col1, col2, *[, statement_params])

Calculates the sample covariance for non-null pairs in two numeric columns.

createOrReplaceTempView(name, *[, ...])

Creates a temporary view that returns the same results as this DataFrame.

createOrReplaceView(name, *[, statement_params])

Creates a view that captures the computation expressed by this DataFrame.

create_or_replace_temp_view(name, *[, ...])

Creates a temporary view that returns the same results as this DataFrame.

create_or_replace_view(name, *[, ...])

Creates a view that captures the computation expressed by this DataFrame.

crossJoin(right, *[, lsuffix, rsuffix])

Performs a cross join, which returns the Cartesian product of the current DataFrame and another DataFrame (right).

cross_join(right, *[, lsuffix, rsuffix])

Performs a cross join, which returns the Cartesian product of the current DataFrame and another DataFrame (right).

crosstab(col1, col2, *[, statement_params])

Computes a pair-wise frequency table (a contingency table) for the specified columns.

cube(*cols)

Performs a SQL GROUP BY CUBE.

describe(*cols)

Computes basic statistics for numeric columns, which includes count, mean, stddev, min, and max.

distinct()

Returns a new DataFrame that contains only the rows with distinct values from the current DataFrame.

drop(*cols)

Returns a new DataFrame that excludes the columns with the specified names from the output.

dropDuplicates(*subset)

Creates a new DataFrame by removing duplicated rows on given subset of columns.

drop_duplicates(*subset)

Creates a new DataFrame by removing duplicated rows on given subset of columns.

dropna([how, thresh, subset])

Returns a new DataFrame that excludes all rows containing fewer than a specified number of non-null and non-NaN values in the specified columns.

except_(other)

Returns a new DataFrame that contains all the rows from the current DataFrame except for the rows that also appear in the other DataFrame.

explain()

Prints the list of queries that will be executed to evaluate this DataFrame.

fillna(value[, subset])

Returns a new DataFrame that replaces all null and NaN values in the specified columns with the values provided.

filter(expr)

Filters rows based on the specified conditional expression (similar to WHERE in SQL).

first()

Executes the query representing this DataFrame and returns the first n rows of the results.

flatten(input[, path, outer, recursive, mode])

Flattens (explodes) compound values into multiple rows.

groupBy(*cols)

Groups rows by the columns specified by expressions (similar to GROUP BY in SQL).

group_by(*cols)

Groups rows by the columns specified by expressions (similar to GROUP BY in SQL).

group_by_grouping_sets(*grouping_sets)

Performs a SQL GROUP BY GROUPING SETS.

intersect(other)

Returns a new DataFrame that contains the intersection of rows from the current DataFrame and another DataFrame (other).

join(right[, on, how, lsuffix, rsuffix])

Performs a join of the specified type (how) with the current DataFrame and another DataFrame (right) on a list of columns (on).

join_table_function(func, *func_arguments, ...)

Lateral joins the current DataFrame with the output of the specified table function.

limit(n[, offset])

Returns a new DataFrame that contains at most n rows from the current DataFrame, skipping offset rows from the beginning (similar to LIMIT and OFFSET in SQL).

minus(other)

Returns a new DataFrame that contains all the rows from the current DataFrame except for the rows that also appear in the other DataFrame.

natural_join(right[, how])

Performs a natural join of the specified type (how) with the current DataFrame and another DataFrame (right).

orderBy(*cols[, ascending])

Sorts a DataFrame by the specified expressions (similar to ORDER BY in SQL).

order_by(*cols[, ascending])

Sorts a DataFrame by the specified expressions (similar to ORDER BY in SQL).

pivot(pivot_col, values)

Rotates this DataFrame by turning the unique values from one column in the input expression into multiple columns and aggregating results where required on any remaining column values.

randomSplit(weights[, seed, statement_params])

Randomly splits the current DataFrame into separate DataFrames, using the specified weights.

random_split(weights[, seed, statement_params])

Randomly splits the current DataFrame into separate DataFrames, using the specified weights.

rename(existing, new)

Returns a DataFrame with the specified column existing renamed as new.

replace(to_replace[, value, subset])

Returns a new DataFrame that replaces values in the specified columns.

rollup(*cols)

Performs a SQL GROUP BY ROLLUP.

sample([frac, n])

Samples rows based on either the number of rows to be returned or a percentage of rows to be returned.

sampleBy(col, fractions)

Returns a DataFrame containing a stratified sample without replacement, based on a dict that specifies the fraction for each stratum.

sample_by(col, fractions)

Returns a DataFrame containing a stratified sample without replacement, based on a dict that specifies the fraction for each stratum.

select(*cols)

Returns a new DataFrame with the specified Column expressions as output (similar to SELECT in SQL).

selectExpr(*exprs)

Projects a set of SQL expressions and returns a new DataFrame.

select_expr(*exprs)

Projects a set of SQL expressions and returns a new DataFrame.

show([n, max_width, statement_params])

Evaluates this DataFrame and prints out the first n rows with the specified maximum number of characters per column.

sort(*cols[, ascending])

Sorts a DataFrame by the specified expressions (similar to ORDER BY in SQL).

subtract(other)

Returns a new DataFrame that contains all the rows from the current DataFrame except for the rows that also appear in the other DataFrame.

take([n, statement_params, block])

Executes the query representing this DataFrame and returns the first n rows of the results.

toDF(*names)

Creates a new DataFrame containing columns with the specified names.

toLocalIterator(*[, statement_params, block])

Executes the query representing this DataFrame and returns an iterator of Row objects that you can use to retrieve the results.

toPandas(*[, statement_params, block])

Executes the query representing this DataFrame and returns the result as a Pandas DataFrame.

to_df(*names)

Creates a new DataFrame containing columns with the specified names.

to_local_iterator()

Executes the query representing this DataFrame and returns an iterator of Row objects that you can use to retrieve the results.

to_pandas()

Executes the query representing this DataFrame and returns the result as a Pandas DataFrame.

to_pandas_batches()

Executes the query representing this DataFrame and returns an iterator of Pandas dataframes (containing a subset of rows) that you can use to retrieve the results.

union(other)

Returns a new DataFrame that contains all the rows in the current DataFrame and another DataFrame (other), excluding any duplicate rows.

unionAll(other)

Returns a new DataFrame that contains all the rows in the current DataFrame and another DataFrame (other), including any duplicate rows.

unionAllByName(other)

Returns a new DataFrame that contains all the rows in the current DataFrame and another DataFrame (other), including any duplicate rows.

unionByName(other)

Returns a new DataFrame that contains all the rows in the current DataFrame and another DataFrame (other), excluding any duplicate rows.

union_all(other)

Returns a new DataFrame that contains all the rows in the current DataFrame and another DataFrame (other), including any duplicate rows.

union_all_by_name(other)

Returns a new DataFrame that contains all the rows in the current DataFrame and another DataFrame (other), including any duplicate rows.

union_by_name(other)

Returns a new DataFrame that contains all the rows in the current DataFrame and another DataFrame (other), excluding any duplicate rows.

unpivot(value_column, name_column, column_list)

Rotates a table by transforming columns into rows.

where(expr)

Filters rows based on the specified conditional expression (similar to WHERE in SQL).

withColumn(col_name, col)

Returns a DataFrame with an additional column with the specified name col_name.

withColumnRenamed(existing, new)

Returns a DataFrame with the specified column existing renamed as new.

with_column(col_name, col)

Returns a DataFrame with an additional column with the specified name col_name.

with_column_renamed(existing, new)

Returns a DataFrame with the specified column existing renamed as new.

with_columns(col_names, values)

Returns a DataFrame with additional columns with the specified names col_names.

Attributes

columns

Returns all column names as a list.

dtypes

na

Returns a DataFrameNaFunctions object that provides functions for handling missing values in the DataFrame.

queries

Returns a dict that contains a list of queries that will be executed to evaluate this DataFrame with the key queries, and a list of post-execution actions (e.g., queries to clean up temporary objects) with the key post_actions.

schema

The definition of the columns in this DataFrame (the "relational schema" for the DataFrame).

stat

write

Returns a new DataFrameWriter object that you can use to write the data in the DataFrame to a Snowflake database or a stage location

is_cached

Whether the dataframe is cached.