snowflake.snowpark.dataframe.map¶

snowflake.snowpark.dataframe.map(dataframe: DataFrame, func: Callable, output_types: List[StructType], *, output_column_names: Optional[List[str]] = None, imports: Optional[List[Union[str, Tuple[str, str]]]] = None, packages: Optional[List[Union[str, module]]] = None, immutable: bool = False, partition_by: Optional[Union[Column, str, List[Union[Column, str]]]] = None, vectorized: bool = False, max_batch_size: Optional[int] = None)[source]¶

Returns a new DataFrame with the result of applying func to each of the rows of the specified DataFrame.

This function registers a temporary UDTF and returns a new DataFrame with the result of applying the func function to each row of the given DataFrame.

Parameters:
  • dataframe – The DataFrame instance.

  • func – A function to be applied to every row of the DataFrame.

  • output_types – A list of types for values generated by the func

  • output_column_names – A list of names to be assigned to the resulting columns.

  • imports – A list of imports that are required to run the function. This argument is passed on when registering the UDTF.

  • packages – A list of packages that are required to run the function. This argument is passed on when registering the UDTF.

  • immutable – A flag to specify if the result of the func is deterministic for the same input.

  • partition_by – Specify the partitioning column(s) for the UDTF.

  • vectorized – A flag to determine if the UDTF process should be vectorized. See vectorized UDTFs.

  • max_batch_size – The maximum number of rows per input pandas DataFrame when using vectorized option.

Example 1:

>>> from snowflake.snowpark.types import IntegerType
>>> from snowflake.snowpark.dataframe import map
>>> import pandas as pd
>>> df = session.create_dataframe([[10, "a", 22], [20, "b", 22]], schema=["col1", "col2", "col3"])
>>> new_df = map(df, lambda row: row[0] * row[0], output_types=[IntegerType()])
>>> new_df.order_by("c_1").show()
---------
|"C_1"  |
---------
|100    |
|400    |
---------
Copy

Example 2:

>>> new_df = map(df, lambda row: (row[1], row[0] * 3), output_types=[StringType(), IntegerType()])
>>> new_df.order_by("c_1").show()
-----------------
|"C_1"  |"C_2"  |
-----------------
|a      |30     |
|b      |60     |
-----------------
Copy

Example 3:

>>> new_df = map(
...     df,
...     lambda row: (row[1], row[0] * 3),
...     output_types=[StringType(), IntegerType()],
...     output_column_names=['col1', 'col2']
... )
>>> new_df.order_by("col1").show()
-------------------
|"COL1"  |"COL2"  |
-------------------
|a       |30      |
|b       |60      |
-------------------
Copy

Example 4:

>>> new_df = map(df, lambda pdf: pdf['COL1']*3, output_types=[IntegerType()], vectorized=True, packages=["pandas"])
>>> new_df.order_by("c_1").show()
---------
|"C_1"  |
---------
|30     |
|60     |
---------
Copy

Example 5:

>>> new_df = map(
...     df,
...     lambda pdf: (pdf['COL1']*3, pdf['COL2']+"b"),
...     output_types=[IntegerType(), StringType()],
...     output_column_names=['A', 'B'],
...     vectorized=True,
...     packages=["pandas"],
... )
>>> new_df.order_by("A").show()
-------------
|"A"  |"B"  |
-------------
|30   |ab   |
|60   |bb   |
-------------
Copy

Example 6:

>>> new_df = map(
...     df,
...     lambda pdf: ((pdf.shape[0],) * len(pdf), (pdf.shape[1],) * len(pdf)),
...     output_types=[IntegerType(), IntegerType()],
...     output_column_names=['rows', 'cols'],
...     partition_by="col3",
...     vectorized=True,
...     packages=["pandas"],
... )
>>> new_df.show()
-------------------
|"ROWS"  |"COLS"  |
-------------------
|2       |3       |
|2       |3       |
-------------------
Copy

Note

1. The result of the func function must be either a scalar value or a tuple containing the same number of elements as specified in the output_types argument.

2. When using the vectorized option, the func function must accept a pandas DataFrame as input and return either a pandas DataFrame, or a tuple of pandas Series/arrays.