snowflake.snowpark.functions.pandas_udtf¶
- snowflake.snowpark.functions.pandas_udtf(handler: Optional[Callable] = None, *, output_schema: Union[StructType, List[str], PandasDataFrameType], input_types: Optional[List[DataType]] = None, name: Optional[Union[str, Iterable[str]]] = None, is_permanent: bool = False, stage_location: Optional[str] = None, imports: Optional[List[Union[str, Tuple[str, str]]]] = None, packages: Optional[List[Union[str, module]]] = None, replace: bool = False, if_not_exists: bool = False, session: Optional[Session] = None, parallel: int = 4, statement_params: Optional[Dict[str, str]] = None, strict: bool = False, secure: bool = False, external_access_integrations: Optional[List[str]] = None, secrets: Optional[Dict[str, str]] = None, immutable: bool = False) Union[UserDefinedTableFunction, partial] [source]¶
Registers a Python class as a vectorized Python UDTF and returns the UDTF.
The arguments, return value and usage of this function are exactly the same as
udtf()
, but this function can only be used for registering vectorized UDTFs. See examples inUDTFRegistration
.See also
UDTFRegistration.register()
Compared to the default row-by-row processing pattern of a normal UDTF, which sometimes is inefficient, vectorized Python UDTFs (user-defined table functions) enable seamless partition-by-partition processing by operating on partitions as Pandas DataFrames and returning results as Pandas DataFrames or lists of Pandas arrays or Pandas Series.
In addition, vectorized Python UDTFs allow for easy integration with libraries that operate on pandas DataFrames or pandas arrays.
A vectorized UDTF handler class: - defines an
end_partition
method that takes in a DataFrame argument and returns apandas.DataFrame
or a tuple ofpandas.Series
orpandas.arrays
where each array is a column. - does NOT define aprocess
method. - optionally defines a handler class with an__init__
method which will be invoked before processing each partition.You can use
udtf()
,register()
orpandas_udtf()
to create a vectorized UDTF by providing appropriate return and input types. If you would like to useregister_from_file()
to create a vectorized UDTF, you need to explicitly mark the handler method as vectorized using either the decorator @vectorized(input=pandas.DataFrame) or setting <class>.end_partition._sf_vectorized_input = pandas.DataFrameNote: A vectorized UDTF must be called with ~snowflake.snowpark.Window.partition_by to build the partitions.
- Example::
>>> from snowflake.snowpark.types import PandasSeriesType, PandasDataFrameType, IntegerType >>> class multiply: ... def __init__(self): ... self.multiplier = 10 ... def end_partition(self, df): ... df.columns = ['id', 'col1', 'col2'] ... df.col1 = df.col1*self.multiplier ... df.col2 = df.col2*self.multiplier ... yield df >>> multiply_udtf = pandas_udtf( ... multiply, ... output_schema=PandasDataFrameType([StringType(), IntegerType(), FloatType()], ["id_", "col1_", "col2_"]), ... input_types=[PandasDataFrameType([StringType(), IntegerType(), FloatType()])] ... ) >>> df = session.create_dataframe([['x', 3, 35.9],['x', 9, 20.5]], schema=["id", "col1", "col2"]) >>> df.select(multiply_udtf("id", "col1", "col2").over(partition_by=["id"])).sort("col1_").show() ----------------------------- |"ID_" |"COL1_" |"COL2_" | ----------------------------- |x |30 |359.0 | |x |90 |205.0 | -----------------------------
Example:
>>> @pandas_udtf(output_schema=PandasDataFrameType([StringType(), IntegerType(), FloatType()], ["id_", "col1_", "col2_"]), input_types=[PandasDataFrameType([StringType(), IntegerType(), FloatType()])]) ... class _multiply: ... def __init__(self): ... self.multiplier = 10 ... def end_partition(self, df): ... df.columns = ['id', 'col1', 'col2'] ... df.col1 = df.col1*self.multiplier ... df.col2 = df.col2*self.multiplier ... yield df >>> df.select(multiply_udtf("id", "col1", "col2").over(partition_by=["id"])).sort("col1_").show() ----------------------------- |"ID_" |"COL1_" |"COL2_" | ----------------------------- |x |30 |359.0 | |x |90 |205.0 | -----------------------------