Python UDF Batch API¶
This topic introduces the Python UDF Batch API.
In this Topic:
Overview¶
The Python UDF batch API enables defining Python functions that receive batches of input rows as Pandas DataFrames and return batches of results as Pandas arrays or Series. You call Python UDFs that use the batch API the same way you call other Python UDFs.
Advantages of using the batch API compared to the default row-by-row processing pattern include:
The potential for better performance if your Python code operates efficiently on batches of rows.
Less transformation logic required if you are calling into libraries that operate on Pandas DataFrames or Pandas arrays.
When you use the batch API:
You do not need to change how you write queries using Python UDFs. All batching is handled by the UDF framework rather than your own code.
As with the non-batch API, there is no guarantee of which instances of your handler code will see which batches of input.
Getting Started with the Batch API¶
To create a Python UDF that uses the batch API, use one of the supported mechanisms for annotating your handler function.
Using the vectorized
Decorator¶
The _snowflake
module is exposed to Python UDFs that execute within Snowflake. In your Python code, import the _snowflake
module,
and use the vectorized
decorator to specify that your handler expects to receive a Pandas DataFrame by setting the input
parameter to pandas.DataFrame
.
create function add_one_to_inputs(x number(10, 0), y number(10, 0))
returns number(10, 0)
language python
runtime_version = 3.8
packages = ('pandas')
handler = 'add_one_to_inputs'
as $$
import pandas
from _snowflake import vectorized
@vectorized(input=pandas.DataFrame)
def add_one_to_inputs(df):
return df[0] + df[1] + 1
$$;
Using a Function Attribute¶
Rather than importing the _snowflake module and using the vectorized
decorator, you can set the special _sf_vectorized_input
attribute on your handler function to indicate that you want to use the batch API.
create function add_one_to_inputs(x number(10, 0), y number(10, 0))
returns number(10, 0)
language python
runtime_version = 3.8
packages = ('pandas')
handler = 'add_one_to_inputs'
as $$
import pandas
def add_one_to_inputs(df):
return df[0] + df[1] + 1
add_one_to_inputs._sf_vectorized_input = pandas.DataFrame
$$;
Setting a Target Batch Size¶
Calls to the Python handler function that implement the batch API must execute within a time limit,
which is 180 seconds, and each DataFrame passed as input to the handler function may currently contain
up to a few thousand rows. In order to stay within the time limit, you may want to set the target batch
size for your handler function, which imposes a maximum number of rows per input DataFrame.
Note that setting a larger value does not guarantee that Snowflake will encode batches with the specified number of rows.
You can set the target batch size using either the vectorized
decorator or an attribute on the function.
Note
Using max_batch_size
is only meant as a mechanism to limit the number of rows that UDF can handle per single batch.
For example, if the UDF is written in a way that can only process at most 100 rows at a time, then max_batch_size
should be set to 100.
Setting max_batch_size
is not meant to be used as a mechanism to specify arbitrary large batch sizes.
If the UDF is able to process batches of any size, it is recommended to leave this parameter unset.
Using the vectorized
Decorator¶
To set the target batch size using the vectorized
decorator, pass a positive integer value for the argument named max_batch_size
.
As an example, this statement creates a Python UDF that uses the batch API and limits each Dataframe to a maximum of 100 rows:
create function add_one_to_inputs(x number(10, 0), y number(10, 0))
returns number(10, 0)
language python
runtime_version = 3.8
packages = ('pandas')
handler = 'add_one_to_inputs'
as $$
import pandas
from _snowflake import vectorized
@vectorized(input=pandas.DataFrame, max_batch_size=100)
def add_one_to_inputs(df):
return df[0] + df[1] + 1
$$;
Using a Function Attribute¶
To set the target batch size using a function attribute, set a positive integer value for the _sf_max_batch_size
attribute on your handler function.
As an example, this statement creates a Python UDF that uses the batch API and limits each DataFrame to a maximum of 100 rows:
create function add_one_to_inputs(x number(10, 0), y number(10, 0))
returns number(10, 0)
language python
runtime_version = 3.8
packages = ('pandas')
handler = 'add_one_to_inputs'
as $$
import pandas
def add_one_to_inputs(df):
return df[0] + df[1] + 1
add_one_to_inputs._sf_vectorized_input = pandas.DataFrame
add_one_to_inputs._sf_max_batch_size = 100
$$;
DataFrame Encoding¶
Batches of arguments to the UDF are encoded as arrays in the input Pandas DataFrames, and the number of rows in each DataFrame may vary. For more information, see Setting a target batch size. Arguments can be accessed in the DataFrame by their index, i.e. the first argument has an index of 0, the second has an index of 1, and so on. The Pandas array or Series that the UDF handler returns must have the same length as that of the input DataFrame.
To illustrate, suppose that you define a Python UDF using the batch API as follows:
create or replace function add_inputs(x int, y float)
returns float
language python
runtime_version = 3.8
packages = ('pandas')
handler = 'add_inputs'
as $$
import pandas
from _snowflake import vectorized
@vectorized(input=pandas.DataFrame)
def add_inputs(df):
return df[0] + df[1]
$$;
This UDF uses df[0]
to access the Pandas array for the first argument, and df[1]
for the second. df[0] + df[1]
results in a Pandas array with the pairwise sums of corresponding elements from the two arrays. After creating the UDF, you might call it with some input rows:
select add_inputs(x, y)
from (
select 1 as x, 3.14::float as y union all
select 2, 1.59 union all
select 3, -0.5
);
+------------------+
| ADD_INPUTS(X, Y) |
|------------------|
| 4.14 |
| 3.59 |
| 2.5 |
+------------------+
Here the add_inputs
Python function receives a DataFrame analogous to one created with the following Python code:
>>> import pandas
>>> df = pandas.DataFrame({0: pandas.array([1, 2, 3]), 1: pandas.array([3.14, 1.59, -0.5])})
>>> df
0 1
0 1 3.14
1 2 1.59
2 3 -0.50
The line return df[0] + df[1]
in the handler function results in an array similar to the following Python code:
>>> df[0] + df[1]
0 4.14
1 3.59
2 2.50
dtype: float64
Type Support¶
The batch API supports the following SQL types for arguments and return values. The table reflects how each SQL argument is encoded as a Pandas array of a particular dtype.
SQL Type |
Pandas dtype |
Notes |
---|---|---|
NUMBER |
|
To ensure that an input argument to a UDF is interpreted as not nullable, pass a column from a table created using the |
FLOAT |
|
NULL values are encoded as NaN values. In the output, NaN values are interpreted as NULLs. |
BOOLEAN |
|
|
VARCHAR |
|
Both Snowflake SQL and Pandas represent strings using UTF-8 encoding. |
BINARY |
|
|
DATE |
|
Each value is encoded as a |
VARIANT |
|
Each variant row is converted to a Python type dynamically for arguments and vice versa for return values. The following types are converted to strings rather than native Python types: |
OBJECT |
|
|
ARRAY |
|
|
TIME |
|
Each value is encoded as an offset from midnight. NULL values are encoded as |
TIMESTAMP_LTZ |
|
Uses the local time zone to encode each value as a nanosecond-scale |
TIMESTAMP_NTZ |
|
Encodes each value as a nanosecond-scale |
TIMESTAMP_TZ |
|
Encodes each value as a nanosecond-scale |
GEOGRAPHY |
|
Formats each value as GeoJSON and then converts it to a Python |
The following types are accepted as output: Pandas Series
or array
, NumPy array
, regular
Python list
, and any iterable sequence that contains the expected types described
in Type Support. It is efficient to use Pandas Series
and array
and NumPy array
where the dtype is bool
, boolean
,
int16
, int32
, int64
, Int16
, Int32
, Int64
, or float64
because they expose their contents as memoryviews
. This means that the contents can be copied rather than each value
being read sequentially.