Designing Python UDFs

This topic helps you design Python UDFs.

In this Topic:

Note

There is a Python UDF batch API, which enables defining Python functions that receive batches of input rows as Pandas DataFrames and return batches of results as Pandas arrays or Series. The batch interface results in much better performance with machine learning inference scenarios. For more information, see Python UDF Batch API.

Choosing Your Data Types

Before you write your code:

  • Choose the data types your function should accept as arguments and the data type your function should return.

  • Take into account time-zone related issues.

  • Decide how to handle NULL values.

SQL-Python Data Type Mappings for Parameters and Return Types

The table below shows the type mappings between SQL and Python. These mappings generally apply to both the arguments passed to the Python UDF and the values returned from the UDF.

SQL Type

Python Type

Notes

NUMBER

int or decimal.Decimal

If the scale of the NUMBER type is 0 then the int Python type is used. Otherwise decimal.Decimal type is used.

FLOAT

float

Floating point operations can have small rounding errors, which can accumulate, especially when aggregate functions process large numbers of rows. Rounding errors can vary each time a query is executed if the rows are processed in a different order. For more information, see Numeric Data Types: Float.

VARCHAR

str

BINARY

bytes

BOOLEAN

bool

DATE

datetime.date

TIME

datetime.time

Although Snowflake can store time values with nanosecond precision, the Python datetime.time type maintains only millisecond precision. Conversion between Snowflake and Python data types can reduce effective precision to milliseconds.

TIMESTAMP_LTZ

datetime.datetime

Use local timezone to convert internal UTC time to local “naive” datetime. Requires “naive” datetime as return type.

TIMESTAMP_NTZ

datetime.datetime

Directly convert to “naive” datetime. Requires “naive” datetime as return type.

TIMESTAMP_TZ

datetime.datetime

Convert to “aware” datetime with timezone information. Requires “aware” datetime as return type.

VARIANT

dict, list, int, float, str, or bool

Each variant row is converted to a Python type dynamically for arguments and vice versa for return values. The following types are converted to strings rather than native Python types: decimal, binary, date, time, timestamp_ltz, timestamp_ntz, timestamp_tz. When a Python data type is converted to VARIANT, if there is any embedded Python decimal data, the embedded Python decimal will be converted to a String in the VARIANT.

OBJECT

dict

When a Python data type is converted to OBJECT, if there is any embedded Python decimal data, the embedded Python decimal will be converted to a String in the OBJECT.

ARRAY

list

When a Python data type is converted to ARRAY, if there is any embedded Python decimal data, the embedded Python decimal will be converted to a String in the ARRAY.

GEOGRAPHY

dict

Formats the geography as GeoJSON and then it is converted to a Python dict.

TIMESTAMP_LTZ Values and Time Zones

A Python UDF is largely isolated from the environment in which it is called. However, the timezone is inherited from the calling environment. If the caller’s session set a default time zone before calling the Python UDF, then the Python UDF has the same default time zone. For more information about timezones, see TIMEZONE.

NULL Values

For all Snowflake types except Variant, a SQL NULL argument to a Python UDF translates to the Python None value and a returned Python None value translates back to SQL NULL.

A Variant type value can be: SQL NULL or a VARIANT JSON null. For information about Snowflake VARIANT NULL, see NULL Values.

  • A VARIANT JSON null is translated to Python None.

  • A SQL NULL is translated to a Python object, which has the is_sql_null attribute.

For an example, see NULL Handling in Python UDFs.

Designing Python UDFs that Stay Within Snowflake-Imposed Constraints

To ensure stability within the Snowflake environment, Snowflake places the following constraints on Python UDFs. Unless stated otherwise, these limitations are enforced when the UDF is executed, not when it is created.

Training machine learning (ML) models can sometimes be very resource intensive. Snowpark-optimized warehouses are a type of Snowflake virtual warehouse that can be used for workloads that require a large amount of memory and compute resources. For information on machine learning models and Snowpark Python, see Training Machine Learning Models with Snowpark Python.

Memory

Avoid consuming too much memory.

  • Large data values can consume a large amount of memory.

  • Excessive stack depth can consume a large amount of memory.

UDFs return an error if they consume too much memory. The specific limit is subject to change.

If UDFs fail due to consuming too much memory, consider using Snowpark-optimized Warehouses.

Time

Avoid algorithms that take a large amount of time per call.

If a UDF takes too long to complete, Snowflake kills the SQL statement and returns an error to the user. This limits the impact and cost of errors such as infinite loops.

Designing the Module

When a SQL statement calls your Python UDF, Snowflake calls a Python function you have written. Your Python function is called a “handler function”, or “handler” for short. The handler is a function implemented inside a user-supplied module.

As with any Python function, your function must be declared as part of a module.

The handler is called once for each row passed to the Python UDF. The module that contains the function is not re-imported for each row. Snowflake can call the same module’s handler function more than once.

To optimize execution of your code, Snowflake assumes that initialization might be slow, while execution of the handler function is fast. Snowflake sets a longer timeout for executing initialization (including the time to load your UDF and the time to initialize the module) than for executing the handler (the time to call your handler with one row of input).

Additional information about designing the module is in Creating Python UDFs.

Optimizing Initialization and Controlling Global State in Scalar UDFs

Most scalar UDFs should follow the guidelines below:

  • If you need to initialize shared state that does not change across rows, initialize it in the module instead of the handler function.

  • Write your handler function to be thread safe.

  • Avoid storing and sharing dynamic state across rows.

If your UDF cannot follow these guidelines, be aware that Snowflake expects scalar UDFs to be processed independently. Relying on state shared between invocations can result in unexpected behavior, as the system can process rows in any order and spread those invocations across several instances. In addition, there can be multiple executions of the same handler function within the same Python interpreter on multiple threads.

UDFs should avoid relying on shared state across calls to the handler function. However, there are two situations in which you might want a UDF to store shared state:

  • Code that contains expensive initialization logic that you do not want to repeat for each row.

  • Code that leverages shared state across rows, such as a cache.

When it’s necessary to maintain global state that will be shared across handler invocations, you must protect global state against data races by using the synchronization primitives described in threading - Thread-based parallelism.

Optimizing for Scale and Performance

Use the Python Batch API with Data Science Libraries

When your code will use machine learning or data science libraries, use the Python UDF batch API. With the batch API, you can define Python functions that receive input rows in batches on which these libraries are optimized to operate.

For more information, see Python UDF Batch API.

Write Single-Threaded UDF Handlers

Write UDF handlers that are single-threaded. Snowflake will handle partitioning the data and scaling the UDF across the virtual warehouse compute resources.

Put Expensive Initialization in the Module

Put expensive initialization code into the module scope. There, it will be performed once when the UDF is initialized. Avoid rerunning the expensive initialization code on every UDF handler invocation.

Handling Errors

A Python function used as a UDF can use the normal Python exception-handling techniques to catch errors within the function.

If an exception occurs inside the function and is not caught by the function, Snowflake raises an error that includes the stack trace for the exception.

You can explicitly throw an exception without catching it in order to end the query and produce a SQL error. For example:

if (x < 0):
  raise ValueError("x must be non-negative.");

When debugging, you can include values in the SQL error message text. To do so, place an entire Python function body in a try-catch block; append argument values to the caught error’s message; and throw an exception with the extended message. To avoid revealing sensitive data, remove argument values prior to deploying to a production environment.

Following Good Security Practices

  • Your function (and any library functions that you call) must act as a pure function, acting only on the data it receives and returning a value based on that data, without causing side-effects. Your code should not attempt to affect the state of the underlying system, other than consuming a reasonable amount of memory and processor time.

  • Python UDFs are executed within a restricted engine. Neither your code nor the code in library functions that you use should employ any prohibited system calls, including:

    • Process control. For example, you cannot fork a process. (However, you can use multiple threads.)

    • File system access.

      With the following exceptions, Python UDFs should not read or write files:

      • Python UDFs can read files specified in the imports clause of the CREATE FUNCTION command.

        For more information, see Reading Files with a UDF Handler. For an example, see Loading a File from a Stage into a Python UDF.

      • Python UDFs can write files, such as log files, to the /tmp directory.

        Each query gets its own memory-backed file system in which its own /tmp is stored, so different queries cannot have file name conflicts. However, conflicts within a query are possible if a single query calls more than one UDF, and those UDFs try to write to the same file name. Also, because Python UDFs may execute in separate worker processes in parallel, you should be careful about writing into the /tmp directory.

        For more on writing files, see Writing Files with a UDF Handler. For an example, see Unzipping a Staged File.

    • Network access.

      Note

      Because your code cannot access the network directly or indirectly, you cannot use the code in the Snowflake Python Connector to access the database. Your UDF cannot itself act as a client of Snowflake.

Back to top