You are viewing documentation about an older version (1.44.0). View latest version

snowflake.snowpark.dataframe.map¶

snowflake.snowpark.dataframe.map(dataframe: DataFrame, func: Callable, output_types: List[StructType], *, output_column_names: Optional[List[str]] = None, imports: Optional[List[Union[str, Tuple[str, str]]]] = None, packages: Optional[List[Union[str, module]]] = None, immutable: bool = False, partition_by: Optional[Union[Column, str, List[Union[Column, str]]]] = None, vectorized: bool = False, max_batch_size: Optional[int] = None)[source]¶

Returns a new DataFrame with the result of applying func to each of the rows of the specified DataFrame.

This function registers a temporary UDTF and returns a new DataFrame with the result of applying the func function to each row of the given DataFrame.

Parameters:

dataframe – The DataFrame instance.
func – A function to be applied to every row of the DataFrame.
output_types – A list of types for values generated by the func
output_column_names – A list of names to be assigned to the resulting columns.
imports – A list of imports that are required to run the function. This argument is passed on when registering the UDTF.
packages – A list of packages that are required to run the function. This argument is passed on when registering the UDTF.
immutable – A flag to specify if the result of the func is deterministic for the same input.
partition_by – Specify the partitioning column(s) for the UDTF.
vectorized – A flag to determine if the UDTF process should be vectorized. See vectorized UDTFs.
max_batch_size – The maximum number of rows per input pandas DataFrame when using vectorized option.

Example 1:

>>> from snowflake.snowpark.types import IntegerType
>>> from snowflake.snowpark.dataframe import map
>>> import pandas as pd
>>> df = session.create_dataframe([[10, "a", 22], [20, "b", 22]], schema=["col1", "col2", "col3"])
>>> new_df = map(df, lambda row: row[0] * row[0], output_types=[IntegerType()])
>>> new_df.order_by("c_1").show()
---------
|"C_1"  |
---------
|100    |
|400    |
---------

Example 2:

>>> new_df = map(df, lambda row: (row[1], row[0] * 3), output_types=[StringType(), IntegerType()])
>>> new_df.order_by("c_1").show()
-----------------
|"C_1"  |"C_2"  |
-----------------
|a      |30     |
|b      |60     |
-----------------

Example 3:

>>> new_df = map(
...     df,
...     lambda row: (row[1], row[0] * 3),
...     output_types=[StringType(), IntegerType()],
...     output_column_names=['col1', 'col2']
... )
>>> new_df.order_by("col1").show()
-------------------
|"COL1"  |"COL2"  |
-------------------
|a       |30      |
|b       |60      |
-------------------

Example 4:

>>> new_df = map(df, lambda pdf: pdf['COL1']*3, output_types=[IntegerType()], vectorized=True, packages=["pandas"])
>>> new_df.order_by("c_1").show()
---------
|"C_1"  |
---------
|30     |
|60     |
---------

Example 5:

>>> new_df = map(
...     df,
...     lambda pdf: (pdf['COL1']*3, pdf['COL2']+"b"),
...     output_types=[IntegerType(), StringType()],
...     output_column_names=['A', 'B'],
...     vectorized=True,
...     packages=["pandas"],
... )
>>> new_df.order_by("A").show()
-------------
|"A"  |"B"  |
-------------
|30   |ab   |
|60   |bb   |
-------------

Example 6:

>>> new_df = map(
...     df,
...     lambda pdf: ((pdf.shape[0],) * len(pdf), (pdf.shape[1],) * len(pdf)),
...     output_types=[IntegerType(), IntegerType()],
...     output_column_names=['rows', 'cols'],
...     partition_by="col3",
...     vectorized=True,
...     packages=["pandas"],
... )
>>> new_df.show()
-------------------
|"ROWS"  |"COLS"  |
-------------------
|2       |3       |
|2       |3       |
-------------------

Note

1. The result of the func function must be either a scalar value or a tuple containing the same number of elements as specified in the output_types argument.

2. When using the vectorized option, the func function must accept a pandas DataFrame as input and return either a pandas DataFrame, or a tuple of pandas Series/arrays.