snowflake.snowpark.DataFrame¶
- class snowflake.snowpark.DataFrame(session: Optional[Session] = None, plan: Optional[LogicalPlan] = None, is_cached: bool = False)[source]¶
Bases:
objectRepresents a lazily-evaluated relational dataset that contains a collection of
Rowobjects with columns defined by a schema (column name and type).A DataFrame is considered lazy because it encapsulates the computation or query required to produce a relational dataset. The computation is not performed until you call a method that performs an action (e.g.
collect()).Creating a DataFrame
You can create a DataFrame in a number of different ways, as shown in the examples below.
- Creating tables and data to run the sample code:
>>> session.sql("create or replace temp table prices(product_id varchar, amount number(10, 2))").collect() [Row(status='Table PRICES successfully created.')] >>> session.sql("insert into prices values ('id1', 10.0), ('id2', 20.0)").collect() [Row(number of rows inserted=2)] >>> # Create a CSV file to demo load >>> import tempfile >>> with tempfile.NamedTemporaryFile(mode="w+t") as t: ... t.writelines(["id1, Product A", "\n" "id2, Product B"]) ... t.flush() ... create_stage_result = session.sql("create temp stage test_stage").collect() ... put_result = session.file.put(t.name, "@test_stage/test_dir")
- Example 1
Creating a DataFrame by reading a table in Snowflake:
>>> df_prices = session.table("prices")
- Example 2
Creating a DataFrame by reading files from a stage:
>>> from snowflake.snowpark.types import StructType, StructField, IntegerType, StringType >>> df_catalog = session.read.schema(StructType([StructField("id", StringType()), StructField("name", StringType())])).csv("@test_stage/test_dir") >>> df_catalog.show() --------------------- |"ID" |"NAME" | --------------------- |id1 | Product A | |id2 | Product B | ---------------------
- Example 3
Creating a DataFrame by specifying a sequence or a range:
>>> session.create_dataframe([(1, "one"), (2, "two")], schema=["col_a", "col_b"]).show() --------------------- |"COL_A" |"COL_B" | --------------------- |1 |one | |2 |two | --------------------- >>> session.range(1, 10, 2).to_df("col1").show() ---------- |"COL1" | ---------- |1 | |3 | |5 | |7 | |9 | ----------
- Example 4
Create a new DataFrame by applying transformations to other existing DataFrames:
>>> df_merged_data = df_catalog.join(df_prices, df_catalog["id"] == df_prices["product_id"])
Performing operations on a DataFrame
Broadly, the operations on DataFrame can be divided into two types:
Transformations produce a new DataFrame from one or more existing DataFrames. Note that transformations are lazy and don’t cause the DataFrame to be evaluated. If the API does not provide a method to express the SQL that you want to use, you can use
functions.sqlExpr()as a workaround.Actions cause the DataFrame to be evaluated. When you call a method that performs an action, Snowpark sends the SQL query for the DataFrame to the server for evaluation.
Transforming a DataFrame
The following examples demonstrate how you can transform a DataFrame.
- Example 5
Using the
select()method to select the columns that should be in the DataFrame (similar to adding aSELECTclause):>>> # Return a new DataFrame containing the product_id and amount columns of the prices table. >>> # This is equivalent to: SELECT PRODUCT_ID, AMOUNT FROM PRICES; >>> df_price_ids_and_amounts = df_prices.select(col("product_id"), col("amount"))
- Example 6
Using the
Column.as_()method to rename a column in a DataFrame (similar to usingSELECT col AS alias):>>> # Return a new DataFrame containing the product_id column of the prices table as a column named >>> # item_id. This is equivalent to: SELECT PRODUCT_ID AS ITEM_ID FROM PRICES; >>> df_price_item_ids = df_prices.select(col("product_id").as_("item_id"))
- Example 7
Using the
filter()method to filter data (similar to adding aWHEREclause):>>> # Return a new DataFrame containing the row from the prices table with the ID 1. >>> # This is equivalent to: >>> # SELECT * FROM PRICES WHERE PRODUCT_ID = 1; >>> df_price1 = df_prices.filter((col("product_id") == 1))
- Example 8
Using the
sort()method to specify the sort order of the data (similar to adding anORDER BYclause):>>> # Return a new DataFrame for the prices table with the rows sorted by product_id. >>> # This is equivalent to: SELECT * FROM PRICES ORDER BY PRODUCT_ID; >>> df_sorted_prices = df_prices.sort(col("product_id"))
- Example 9
Using
agg()method to aggregate results.>>> import snowflake.snowpark.functions as f >>> df_prices.agg(("amount", "sum")).collect() [Row(SUM(AMOUNT)=Decimal('30.00'))] >>> df_prices.agg(f.sum("amount")).collect() [Row(SUM(AMOUNT)=Decimal('30.00'))] >>> # rename the aggregation column name >>> df_prices.agg(f.sum("amount").alias("total_amount"), f.max("amount").alias("max_amount")).collect() [Row(TOTAL_AMOUNT=Decimal('30.00'), MAX_AMOUNT=Decimal('20.00'))]
- Example 10
Using the
group_by()method to return aRelationalGroupedDataFramethat you can use to group and aggregate results (similar to adding aGROUP BYclause).RelationalGroupedDataFrameprovides methods for aggregating results, including:RelationalGroupedDataFrame.avg()(equivalent to AVG(column))RelationalGroupedDataFrame.count()(equivalent to COUNT())RelationalGroupedDataFrame.max()(equivalent to MAX(column))RelationalGroupedDataFrame.median()(equivalent to MEDIAN(column))RelationalGroupedDataFrame.min()(equivalent to MIN(column))RelationalGroupedDataFrame.sum()(equivalent to SUM(column))
>>> # Return a new DataFrame for the prices table that computes the sum of the prices by >>> # category. This is equivalent to: >>> # SELECT CATEGORY, SUM(AMOUNT) FROM PRICES GROUP BY CATEGORY >>> df_total_price_per_category = df_prices.group_by(col("product_id")).sum(col("amount")) >>> # Have multiple aggregation values with the group by >>> import snowflake.snowpark.functions as f >>> df_summary = df_prices.group_by(col("product_id")).agg(f.sum(col("amount")).alias("total_amount"), f.avg("amount")).sort(col("product_id")) >>> df_summary.show() ------------------------------------------------- |"PRODUCT_ID" |"TOTAL_AMOUNT" |"AVG(AMOUNT)" | ------------------------------------------------- |id1 |10.00 |10.00000000 | |id2 |20.00 |20.00000000 | -------------------------------------------------
- Example 11
Using windowing functions. Refer to
Windowfor more details.>>> from snowflake.snowpark import Window >>> from snowflake.snowpark.functions import row_number >>> df_prices.with_column("price_rank", row_number().over(Window.order_by(col("amount").desc()))).show() ------------------------------------------ |"PRODUCT_ID" |"AMOUNT" |"PRICE_RANK" | ------------------------------------------ |id2 |20.00 |1 | |id1 |10.00 |2 | ------------------------------------------
- Example 12
Handling missing values. Refer to
DataFrameNaFunctionsfor more details.>>> df = session.create_dataframe([[1, None, 3], [4, 5, None]], schema=["a", "b", "c"]) >>> df.na.fill({"b": 2, "c": 6}).show() ------------------- |"A" |"B" |"C" | ------------------- |1 |2 |3 | |4 |5 |6 | -------------------
Performing an action on a DataFrame
The following examples demonstrate how you can perform an action on a DataFrame.
- Example 13
Performing a query and returning an array of Rows:
>>> df_prices.collect() [Row(PRODUCT_ID='id1', AMOUNT=Decimal('10.00')), Row(PRODUCT_ID='id2', AMOUNT=Decimal('20.00'))]
- Example 14
Performing a query and print the results:
>>> df_prices.show() --------------------------- |"PRODUCT_ID" |"AMOUNT" | --------------------------- |id1 |10.00 | |id2 |20.00 | ---------------------------
- Example 15
Calculating statistics values. Refer to
DataFrameStatFunctionsfor more details.>>> df = session.create_dataframe([[1, 2], [3, 4], [5, -1]], schema=["a", "b"]) >>> df.stat.corr("a", "b") -0.5960395606792697
- Example 16
Performing a query asynchronously and returning a list of
Rowobjects:>>> df = session.create_dataframe([[float(4), 3, 5], [2.0, -4, 7], [3.0, 5, 6], [4.0, 6, 8]], schema=["a", "b", "c"]) >>> async_job = df.collect_nowait() >>> async_job.result() [Row(A=4.0, B=3, C=5), Row(A=2.0, B=-4, C=7), Row(A=3.0, B=5, C=6), Row(A=4.0, B=6, C=8)]
- Example 17
Performing a query and transforming it into
pandas.DataFrameasynchronously:>>> async_job = df.to_pandas(block=False) >>> async_job.result() A B C 0 4.0 3 5 1 2.0 -4 7 2 3.0 5 6 3 4.0 6 8
Methods
agg(*exprs)Aggregate the data in the DataFrame.
alias(name)Returns an aliased dataframe in which the columns can now be referenced to using col(<df alias>, <column name>).
approxQuantile(col, percentile, *[, ...])For a specified numeric column and a list of desired quantiles, returns an approximate value for the column at each of the desired quantiles.
approx_quantile(col, percentile, *[, ...])For a specified numeric column and a list of desired quantiles, returns an approximate value for the column at each of the desired quantiles.
cache_result(*[, statement_params])Caches the content of this DataFrame to create a new cached Table DataFrame.
col(col_name)Returns a reference to a column in the DataFrame.
col_ilike(pattern)Returns a new DataFrame with only the columns whose names match the specified pattern using case-insensitive ILIKE matching (similar to SELECT * ILIKE 'pattern' in SQL).
collect()Executes the query representing this DataFrame and returns the result as a list of
Rowobjects.collect_nowait(*[, statement_params, ...])Executes the query representing this DataFrame asynchronously and returns: class:AsyncJob.
copy_into_table(table_name, *[, files, ...])Executes a COPY INTO <table> command to load data from files in a stage location into a specified table.
corr(col1, col2, *[, statement_params])Calculates the correlation coefficient for non-null pairs in two numeric columns.
count()Executes the query representing this DataFrame and returns the number of rows in the result (similar to the COUNT function in SQL).
cov(col1, col2, *[, statement_params])Calculates the sample covariance for non-null pairs in two numeric columns.
createOrReplaceTempView(name, *[, comment, ...])Creates or replace a temporary view that returns the same results as this DataFrame.
createOrReplaceView(name, *[, comment, ...])Creates a view that captures the computation expressed by this DataFrame.
createTempView(name, *[, comment, ...])Creates a temporary view that returns the same results as this DataFrame.
create_or_replace_dynamic_table(name, *, ...)Creates a dynamic table that captures the computation expressed by this DataFrame.
create_or_replace_temp_view(name, *[, ...])Creates or replace a temporary view that returns the same results as this DataFrame.
create_or_replace_view(name, *[, comment, ...])Creates a view that captures the computation expressed by this DataFrame.
create_temp_view(name, *[, comment, ...])Creates a temporary view that returns the same results as this DataFrame.
crossJoin(right, *[, lsuffix, rsuffix])Performs a cross join, which returns the Cartesian product of the current
DataFrameand anotherDataFrame(right).cross_join(right, *[, lsuffix, rsuffix])Performs a cross join, which returns the Cartesian product of the current
DataFrameand anotherDataFrame(right).crosstab(col1, col2, *[, statement_params])Computes a pair-wise frequency table (a
contingency table) for the specified columns.cube(*cols)Performs a SQL GROUP BY CUBE.
describe(*cols[, strings_include_math_stats])Computes basic statistics for numeric columns, which includes
count,mean,stddev,min, andmax.distinct()Returns a new DataFrame that contains only the rows with distinct values from the current DataFrame.
drop(*cols)Returns a new DataFrame that excludes the columns with the specified names from the output.
dropDuplicates(*subset)Creates a new DataFrame by removing duplicated rows on given subset of columns.
drop_duplicates(*subset)Creates a new DataFrame by removing duplicated rows on given subset of columns.
dropna([how, thresh, subset])Returns a new DataFrame that excludes all rows containing fewer than a specified number of non-null and non-NaN values in the specified columns.
except_(other)Returns a new DataFrame that contains all the rows from the current DataFrame except for the rows that also appear in the
otherDataFrame.explain()Prints the list of queries that will be executed to evaluate this DataFrame.
fillna(value[, subset, include_decimal])Returns a new DataFrame that replaces all null and NaN values in the specified columns with the values provided.
filter(expr)Filters rows based on the specified conditional expression (similar to WHERE in SQL).
first()Executes the query representing this DataFrame and returns the first
nrows of the results.flatten(input[, path, outer, recursive, mode])Flattens (explodes) compound values into multiple rows.
get_execution_profile([output_file, verbose])Get the execution profile of the dataframe.
groupBy(*cols)Groups rows by the columns specified by expressions (similar to GROUP BY in SQL).
group_by(*cols)Groups rows by the columns specified by expressions (similar to GROUP BY in SQL).
group_by_grouping_sets(*grouping_sets)Performs a SQL GROUP BY GROUPING SETS.
intersect(other)Returns a new DataFrame that contains the intersection of rows from the current DataFrame and another DataFrame (
other).join(right[, on, how, lsuffix, rsuffix, ...])Performs a join of the specified type (
how) with the current DataFrame and another DataFrame (right) on a list of columns (on).join_table_function(func, *func_arguments, ...)Lateral joins the current DataFrame with the output of the specified table function.
limit(n[, offset])Returns a new DataFrame that contains at most
nrows from the current DataFrame, skippingoffsetrows from the beginning (similar to LIMIT and OFFSET in SQL).minus(other)Returns a new DataFrame that contains all the rows from the current DataFrame except for the rows that also appear in the
otherDataFrame.natural_join(right[, how])Performs a natural join of the specified type (
how) with the current DataFrame and another DataFrame (right).orderBy(*cols[, ascending])Sorts a DataFrame by the specified expressions (similar to ORDER BY in SQL).
order_by(*cols[, ascending])Sorts a DataFrame by the specified expressions (similar to ORDER BY in SQL).
pivot(pivot_col[, values, default_on_null])Rotates this DataFrame by turning the unique values from one column in the input expression into multiple columns and aggregating results where required on any remaining column values.
printSchema([level])Prints the schema of a dataframe in tree format.
print_schema([level])Prints the schema of a dataframe in tree format.
randomSplit(weights[, seed, statement_params])Randomly splits the current DataFrame into separate DataFrames, using the specified weights.
random_split(weights[, seed, statement_params])Randomly splits the current DataFrame into separate DataFrames, using the specified weights.
rename(col_or_mapper[, new_column])Returns a DataFrame with the specified column
col_or_mapperrenamed asnew_column.replace(to_replace[, value, subset, ...])Returns a new DataFrame that replaces values in the specified columns.
rollup(*cols)Performs a SQL GROUP BY ROLLUP.
sample([frac, n])Samples rows based on either the number of rows to be returned or a percentage of rows to be returned.
sampleBy(col, fractions[, seed])Returns a DataFrame containing a stratified sample without replacement, based on a
dictthat specifies the fraction for each stratum.sample_by(col, fractions[, seed])Returns a DataFrame containing a stratified sample without replacement, based on a
dictthat specifies the fraction for each stratum.select(*cols)Returns a new DataFrame with the specified Column expressions as output (similar to SELECT in SQL).
selectExpr(*exprs)Projects a set of SQL expressions and returns a new
DataFrame.select_expr(*exprs)Projects a set of SQL expressions and returns a new
DataFrame.show([n, max_width, statement_params])Evaluates this DataFrame and prints out the first
nrows with the specified maximum number of characters per column.sort(*cols[, ascending])Sorts a DataFrame by the specified expressions (similar to ORDER BY in SQL).
subtract(other)Returns a new DataFrame that contains all the rows from the current DataFrame except for the rows that also appear in the
otherDataFrame.take([n, statement_params, block])Executes the query representing this DataFrame and returns the first
nrows of the results.toDF(*names)Creates a new DataFrame containing columns with the specified names.
toLocalIterator(*[, statement_params, ...])Executes the query representing this DataFrame and returns an iterator of
Rowobjects that you can use to retrieve the results.toPandas(*[, statement_params, block])Executes the query representing this DataFrame and returns the result as a pandas DataFrame.
to_arrow(*[, statement_params, block])Executes the query representing this DataFrame and returns the result as a pyarrow Table <https://arrow.apache.org/docs/python/generated/pyarrow.Table.html>.
to_arrow_batches(*[, statement_params, block])Executes the query representing this DataFrame and returns an iterator of pyarrow Tables (containing a subset of rows) that you can use to retrieve the results.
to_df(*names)Creates a new DataFrame containing columns with the specified names.
Executes the query representing this DataFrame and returns an iterator of
Rowobjects that you can use to retrieve the results.Executes the query representing this DataFrame and returns the result as a pandas DataFrame.
Executes the query representing this DataFrame and returns an iterator of pandas dataframes (containing a subset of rows) that you can use to retrieve the results.
to_snowpark_pandas([index_col, columns, ...])Convert the Snowpark DataFrame to Snowpark pandas DataFrame.
union(other)Returns a new DataFrame that contains all the rows in the current DataFrame and another DataFrame (
other), excluding any duplicate rows.unionAll(other)Returns a new DataFrame that contains all the rows in the current DataFrame and another DataFrame (
other), including any duplicate rows.unionAllByName(other[, allow_missing_columns])Returns a new DataFrame that contains all the rows in the current DataFrame and another DataFrame (
other), including any duplicate rows.unionByName(other[, allow_missing_columns])Returns a new DataFrame that contains all the rows in the current DataFrame and another DataFrame (
other), excluding any duplicate rows.union_all(other)Returns a new DataFrame that contains all the rows in the current DataFrame and another DataFrame (
other), including any duplicate rows.union_all_by_name(other[, allow_missing_columns])Returns a new DataFrame that contains all the rows in the current DataFrame and another DataFrame (
other), including any duplicate rows.union_by_name(other[, allow_missing_columns])Returns a new DataFrame that contains all the rows in the current DataFrame and another DataFrame (
other), excluding any duplicate rows.unpivot(value_column, name_column, column_list)Rotates a table by transforming columns into rows.
where(expr)Filters rows based on the specified conditional expression (similar to WHERE in SQL).
withColumn(col_name, col, *[, ...])Returns a DataFrame with an additional column with the specified name
col_name.withColumnRenamed(existing, new)Returns a DataFrame with the specified column
existingrenamed asnew.with_column(col_name, col, *[, ...])Returns a DataFrame with an additional column with the specified name
col_name.with_column_renamed(existing, new)Returns a DataFrame with the specified column
existingrenamed asnew.with_columns(col_names, values, *[, ...])Returns a DataFrame with additional columns with the specified names
col_names.Attributes
Returns a
DataFrameAIFunctionsobject that provides AI-powered functions for the DataFrame.analyticsReturns all column names as a list.
dtypesReturns a
DataFrameNaFunctionsobject that provides functions for handling missing values in the DataFrame.Returns a
dictthat contains a list of queries that will be executed to evaluate this DataFrame with the key queries, and a list of post-execution actions (e.g., queries to clean up temporary objects) with the key post_actions.The definition of the columns in this DataFrame (the "relational schema" for the DataFrame).
Returns a
snowflake.snowpark.Sessionobject that provides access to the session the current DataFrame is relying on.Returns a new
DataFrameWriterobject that you can use to write the data in theDataFrameto a Snowflake database or a stage locationWhether the dataframe is cached.