snowflake.snowpark.RelationalGroupedDataFrame.apply_in_pandas¶

RelationalGroupedDataFrame.apply_in_pandas(func: Callable, output_schema: StructType, **kwargs) → DataFrame[source]¶

Maps each grouped dataframe in to a pandas.DataFrame, applies the given function on data of each grouped dataframe, and returns a pandas.DataFrame. Internally, a vectorized UDTF with input func argument as the end_partition is registered and called. Additional kwargs are accepted to specify arguments to register the UDTF. Group by clause used must be column reference, not a general expression.

Depends on pandas being installed in the environment and declared as a dependency using add_packages() or via kwargs["packages"].

Parameters:

func – A Python native function that accepts a single input argument - a pandas.DataFrame object and returns a pandas.Dataframe. It is used as input to end_partition in a vectorized UDTF.
output_schema – A StructType instance that represents the table function’s output columns.
kwargs – Additional arguments to register the vectorized UDTF. See register() for all options.

Examples::

Call apply_in_pandas using temporary UDTF:

>>> import pandas as pd
>>> from snowflake.snowpark.types import StructType, StructField, StringType, FloatType
>>> def convert(pandas_df):
...     pandas_df.columns = ['location', 'temp_c']
...     return pandas_df.assign(temp_f = lambda x: x.temp_c * 9 / 5 + 32)
...
>>> df = session.createDataFrame([('SF', 21.0), ('SF', 17.5), ('SF', 24.0), ('NY', 30.9), ('NY', 33.6)],
...         schema=['location', 'temp_c'])
>>> df.group_by("location").apply_in_pandas(convert,
...     output_schema=StructType([StructField("location", StringType()),
...                               StructField("temp_c", FloatType()),
...                               StructField("temp_f", FloatType())])).order_by("temp_c").show()
---------------------------------------------
|"LOCATION"  |"TEMP_C"  |"TEMP_F"           |
---------------------------------------------
|SF          |17.5      |63.5               |
|SF          |21.0      |69.8               |
|SF          |24.0      |75.2               |
|NY          |30.9      |87.61999999999999  |
|NY          |33.6      |92.48              |
---------------------------------------------

Call apply_in_pandas using permanent UDTF with replacing original UDTF:

>>> from snowflake.snowpark.types import IntegerType, DoubleType
>>> _ = session.sql("create or replace temp stage mystage").collect()
>>> def group_sum(pdf):
...     pdf.columns = ['grade', 'division', 'value']
...     return pd.DataFrame([(pdf.grade.iloc[0], pdf.division.iloc[0], pdf.value.sum(), )])
...
>>> df = session.createDataFrame([('A', 2, 11.0), ('A', 2, 13.9), ('B', 5, 5.0), ('B', 2, 12.1)],
...                              schema=["grade", "division", "value"])
>>> df.group_by([df.grade, df.division] ).applyInPandas(
...     group_sum, output_schema=StructType([StructField("grade", StringType()),
...                                        StructField("division", IntegerType()),
...                                        StructField("sum", DoubleType())]),
...                is_permanent=True, stage_location="@mystage", name="group_sum_in_pandas", replace=True
...            ).order_by("sum").show()
--------------------------------
|"GRADE"  |"DIVISION"  |"SUM"  |
--------------------------------
|B        |5           |5.0    |
|B        |2           |12.1   |
|A        |2           |24.9   |
--------------------------------