snowflake.snowpark.dataframe.map¶
- snowflake.snowpark.dataframe.map(dataframe: DataFrame, func: Callable, output_types: List[StructType], *, output_column_names: Optional[List[str]] = None, imports: Optional[List[Union[str, Tuple[str, str]]]] = None, packages: Optional[List[Union[str, module]]] = None, immutable: bool = False, partition_by: Optional[Union[Column, str, List[Union[Column, str]]]] = None, vectorized: bool = False, max_batch_size: Optional[int] = None)[source]¶
Returns a new DataFrame with the result of applying func to each of the rows of the specified DataFrame.
This function registers a temporary UDTF and returns a new DataFrame with the result of applying the func function to each row of the given DataFrame.
- Parameters:
dataframe – The DataFrame instance.
func – A function to be applied to every row of the DataFrame.
output_types – A list of types for values generated by the
func
output_column_names – A list of names to be assigned to the resulting columns.
imports – A list of imports that are required to run the function. This argument is passed on when registering the UDTF.
packages – A list of packages that are required to run the function. This argument is passed on when registering the UDTF.
immutable – A flag to specify if the result of the func is deterministic for the same input.
partition_by – Specify the partitioning column(s) for the UDTF.
vectorized – A flag to determine if the UDTF process should be vectorized. See vectorized UDTFs.
max_batch_size – The maximum number of rows per input pandas DataFrame when using vectorized option.
Example 1:
>>> from snowflake.snowpark.types import IntegerType >>> from snowflake.snowpark.dataframe import map >>> import pandas as pd >>> df = session.create_dataframe([[10, "a", 22], [20, "b", 22]], schema=["col1", "col2", "col3"]) >>> new_df = map(df, lambda row: row[0] * row[0], output_types=[IntegerType()]) >>> new_df.order_by("c_1").show() --------- |"C_1" | --------- |100 | |400 | ---------
Example 2:
>>> new_df = map(df, lambda row: (row[1], row[0] * 3), output_types=[StringType(), IntegerType()]) >>> new_df.order_by("c_1").show() ----------------- |"C_1" |"C_2" | ----------------- |a |30 | |b |60 | -----------------
Example 3:
>>> new_df = map( ... df, ... lambda row: (row[1], row[0] * 3), ... output_types=[StringType(), IntegerType()], ... output_column_names=['col1', 'col2'] ... ) >>> new_df.order_by("col1").show() ------------------- |"COL1" |"COL2" | ------------------- |a |30 | |b |60 | -------------------
Example 4:
>>> new_df = map(df, lambda pdf: pdf['COL1']*3, output_types=[IntegerType()], vectorized=True, packages=["pandas"]) >>> new_df.order_by("c_1").show() --------- |"C_1" | --------- |30 | |60 | ---------
Example 5:
>>> new_df = map( ... df, ... lambda pdf: (pdf['COL1']*3, pdf['COL2']+"b"), ... output_types=[IntegerType(), StringType()], ... output_column_names=['A', 'B'], ... vectorized=True, ... packages=["pandas"], ... ) >>> new_df.order_by("A").show() ------------- |"A" |"B" | ------------- |30 |ab | |60 |bb | -------------
Example 6:
>>> new_df = map( ... df, ... lambda pdf: ((pdf.shape[0],) * len(pdf), (pdf.shape[1],) * len(pdf)), ... output_types=[IntegerType(), IntegerType()], ... output_column_names=['rows', 'cols'], ... partition_by="col3", ... vectorized=True, ... packages=["pandas"], ... ) >>> new_df.show() ------------------- |"ROWS" |"COLS" | ------------------- |2 |3 | |2 |3 | -------------------
Note
1. The result of the func function must be either a scalar value or a tuple containing the same number of elements as specified in the output_types argument.
2. When using the vectorized option, the func function must accept a pandas DataFrame as input and return either a pandas DataFrame, or a tuple of pandas Series/arrays.