Designing Java UDFs

This topic helps you design Java UDFs.

In this Topic:

Choosing Your Data Types

Before you write your code:

  • Choose the data types your function should accept as arguments and the data type your function should return.

  • Take into account time-zone related issues.

  • Decide how to handle NULL values.

SQL-Java Data Type Mappings for Parameters and Return Types

The table below shows the type mappings between SQL and Java. These mappings generally apply to both the arguments passed to the Java UDF and the values returned from the UDF. However, there are some exceptions, which are listed in footnotes.

Some SQL data types (e.g. NUMBER) are compatible with multiple Java data types (e.g. int, long, etc.). In these cases, you can use any Java data type that has enough capacity to hold the actual values that will be passed. If you pass a SQL value to an incompatible Java data type (or vice versa), Snowflake throws an error.

SQL Type

Java Type

Notes

NUMBER

short

Cannot be null. Must fit in the range of short (no fractional part, and integer part cannot exceed the max/min short values).

NUMBER

Short

Must fit in the range of short (no fractional part, and integer part cannot exceed the max/min short values).

NUMBER

int

Cannot be null. Must fit in the range of int (no fractional part, and integer part cannot exceed the max/min int values).

NUMBER

Integer

Must fit in the range of int (no fractional part, and integer part cannot exceed the max/min int values).

NUMBER

long

Cannot be null. Must fit in the range of long (no fractional part, and integer part cannot exceed the max/min long values).

NUMBER

Long

Must fit in the range of long (no fractional part, and integer part cannot exceed the max/min long values).

NUMBER

java.math.BigDecimal

NUMBER

java.math.BigInteger

Must fit into the range of BigInteger (no fractional part).

NUMBER

String

FLOAT

double

Cannot be null.

FLOAT

Double

FLOAT

float

Cannot be null. Might result in precision loss.

FLOAT

Float

Might result in precision loss.

FLOAT

String

Might result in precision loss (float -> string conversion is lossy).

VARCHAR

String

BINARY

byte[]

BINARY

String

Encodes the binary string in hexadecimal. 4

BOOLEAN

boolean

Cannot be null.

BOOLEAN

Boolean

BOOLEAN

String

4

DATE

java.sql.Date

DATE

String

Formats the date as YYYY-MM-DD. 4

TIME

java.sql.Time

3

TIME

String

Formats the time as HH:MI:SS.SSSSSSSSS where the fractional seconds part depends on the precision of the time. 3

TIMESTAMP_LTZ

java.sql.Timestamp

Must fit in the range of java.sql.Timestamp. 3

TIMESTAMP_LTZ

String

The output format is DY, DD MON YYYY HH24:MI:SS TZHTZM. 1 , 3 , 4

TIMESTAMP_NTZ

java.sql.Timestamp

Must fit in the range of java.sql.Timestamp. Treats the wallclock time as an offset from the Unix epoch (imposing a UTC time zone, effectively). 3

TIMESTAMP_NTZ

String

Treats the wallclock time as an offset from the Unix epoch (imposing a UTC time zone, effectively). The output format is DY, DD MON YYYY HH:MI:SS. 2 , 3 , 4

TIMESTAMP_TZ

java.sql.Timestamp

Must fit in the range of java.sql.Timestamp. 3

TIMESTAMP_TZ

String

The output format is DY, DD MON YYYY HH24:MI:SS TZHTZM. 1 , 3 , 4

VARIANT

String

Formats the value depending on the type that is represented. Variant null is formatted as the string “null”.

OBJECT

Map<String, String>

The map’s keys are the object’s keys, and the values are formatted as strings.

OBJECT

String

Formats the object as a JSON string (e.g. {"x": 3, "y": true}).

ARRAY

String[]

Formats the elements of the array as strings.

ARRAY

String

Formats the array as a JSON string (e.g. [1, "foo", null]).

GEOGRAPHY

String

Formats the geography as GeoJSON .

1(1,2)

The format matches the Internet (RFC) Timestamp Format DY, DD MON YYYY HH24:MI:SS TZHTZM as described in Timestamp Formats. If a timezone offset (the TZHTZM component) is present, it is typically digits (e.g. -0700 indicates 7 hours behind UTC). If the timezone offset is Z (for “Zulu”) rather than digits, that is synonymous with “+0000” (UTC).

2

The format matches the Internet (RFC) Timestamp Format DY, DD MON YYYY HH24:MI:SS as described in Timestamp Formats. If the string is followed by a space and Z (for “Zulu”), that explicitly indicates that the offset is “+0000” (UTC).

3(1,2,3,4,5,6,7,8)

Although Snowflake can store time values with nanosecond precision, the java.sql.time library maintains only millisecond precision. Conversion between Snowflake and Java data types can reduce effective precision to milliseconds.

4(1,2,3,4,5,6)

This type mapping is supported when converting SQL arguments to Java, but not when converting Java return types to SQL types.

TIMESTAMP_LTZ Values and Time Zones

A Java UDF is largely isolated from the environment in which it is called. However, the timezone is inherited from the calling environment. If the caller’s session set a default time zone before calling the Java UDF, then the Java UDF has the same default time zone.

NULL Values

Snowflake supports two distinct NULL values: SQL NULL and VARIANT’s JSON null. (For information about Snowflake VARIANT NULL, see NULL Values.)

Java supports one null value, which is only for non-primitive data types.

A SQL NULL argument to a Java UDF translates to the Java null value, but only for Java data types that support null.

A returned Java null value translates back to SQL NULL.

Arrays and Variable Number of Arguments

Java UDFs can receive arrays of any of the following Java data types:

  • String

  • boolean

  • double

  • float

  • int

  • long

  • short

The data type of the SQL values passed must be compatible with the corresponding Java data type. For details about data type compatibility, see SQL-Java Data Type Mappings for Parameters and Return Types.

The following additional rules apply for each of the specified Java data types:

  • boolean: The Snowflake ARRAY must contain only BOOLEAN elements, and must not contain any NULL values.

  • int/short/long: The Snowflake ARRAY must contain only fixed-point elements with a scale of 0, and must not contain any NULL values.

  • float/double: The Snowflake ARRAY must contain either:

    The ARRAY must not contain any NULL values.

Java methods can receive these arrays in either of two ways:

  • Using Java’s array feature.

  • Using Java’s varargs (variable number of arguments) feature.

In both cases, your SQL code must pass an ARRAY.

Passing via an ARRAY

Declare the Java parameter as an array. For example, the third parameter in the following method is a String array:

static int MyMethod(int fixed_argument_1, int fixed_argument_2, String[] string_array)

Below is a complete example:

Create and load the table:

CREATE TABLE string_array_table(id INTEGER, a ARRAY);
INSERT INTO string_array_table (id, a) SELECT
        1, ARRAY_CONSTRUCT('Hello');
INSERT INTO string_array_table (id, a) SELECT
        2, ARRAY_CONSTRUCT('Hello', 'Jay');
INSERT INTO string_array_table (id, a) SELECT
        3, ARRAY_CONSTRUCT('Hello', 'Jay', 'Smith');

Create the UDF:

create or replace function concat_varchar_2(a ARRAY)
returns varchar
language java
handler='TestFunc_2.concat_varchar_2'
target_path='@~/TestFunc_2.jar'
as
$$
    class TestFunc_2 {
        public static String concat_varchar_2(String[] string_array) {
            return String.join(" ", string_array);
        }
    }
$$;

Call the UDF:

SELECT concat_varchar_2(a)
    FROM string_array_table
    ORDER BY id;
+---------------------+
| CONCAT_VARCHAR_2(A) |
|---------------------|
| Hello               |
| Hello Jay           |
| Hello Jay Smith     |
+---------------------+

Passing via Varargs

Using varargs is very similar to using an array.

In your Java code, use Java’s varargs declaration style:

static int MyMethod(int fixed_argument_1, int fixed_argument_2, String ... string_array)

Below is a complete example. The only significant difference between this example and the preceding example (for arrays) is the declaration of the parameters to the method.

Create and load the table:

CREATE TABLE string_array_table(id INTEGER, a ARRAY);
INSERT INTO string_array_table (id, a) SELECT
        1, ARRAY_CONSTRUCT('Hello');
INSERT INTO string_array_table (id, a) SELECT
        2, ARRAY_CONSTRUCT('Hello', 'Jay');
INSERT INTO string_array_table (id, a) SELECT
        3, ARRAY_CONSTRUCT('Hello', 'Jay', 'Smith');

Create the UDF:

create or replace function concat_varchar(a ARRAY)
returns varchar
language java
handler='TestFunc.concat_varchar'
target_path='@~/TestFunc.jar'
as
$$
    class TestFunc {
        public static String concat_varchar(String ... string_array) {
            return String.join(" ", string_array);
        }
    }
$$;

Call the UDF:

SELECT concat_varchar(a)
    FROM string_array_table
    ORDER BY id;
+-------------------+
| CONCAT_VARCHAR(A) |
|-------------------|
| Hello             |
| Hello Jay         |
| Hello Jay Smith   |
+-------------------+

Designing Java UDFs that Stay Within Snowflake-Imposed Constraints

To ensure stability within the Snowflake environment, Snowflake places the following constraints on Java UDFs. Unless stated otherwise, these limitations are enforced when the UDF is executed, not when it is created.

Memory

Avoid consuming too much memory.

  • Large data values (typically BINARY, long VARCHAR, or large OBJECT or VARIANT data types) can consume a large amount of memory.

  • Excessive stack depth can consume a large amount of memory. Snowflake has tested simple function calls nested 50 levels deep without error. The practical maximum limit depends upon how much information is put on the stack.

UDFs return an error if they consume too much memory. The specific limit is subject to change.

Time

Avoid algorithms that take a large amount of time per call.

If a UDF takes too long to complete, Snowflake kills the SQL statement and returns an error to the user. This limits the impact and cost of errors such as infinite loops.

Libraries

Although your Java method can use classes and methods in the standard Java libraries, Snowflake security restrictions disable some capabilities, such as writing to files. For details about library restrictions, see the section titled Following Good Security Practices.

Designing the Class

When a SQL statement calls your Java UDF, Snowflake calls a Java method you have written. Your Java method is called a “handler method”, or “handler” for short.

As with any Java method, your method must be declared as part of a class. Your handler method can be a static method or an instance method of the class. If your handler is an instance method, and your class defines a zero-argument constructor, then Snowflake invokes your constructor at initialization time to create an instance of your class. If your handler is a static method, your class is not required to have a constructor.

The handler is called once for each row passed to the Java UDF. (Note: a new instance of the class is not created for each row; Snowflake can call the same instance’s handler method more than once, or call the same static method more than once.)

To optimize execution of your code, Snowflake assumes that initialization might be slow, while execution of the handler method is fast. Snowflake sets a longer timeout for executing initialization (including the time to load your UDF and the time to call the constructor of the handler method’s containing class, if a constructor is defined) than for executing the handler (the time to call your handler with one row of input).

Additional information about designing the class is in Creating Java UDFs.

Optimizing Initialization and Controlling Global State in Scalar UDFs

Most scalar UDFs should follow the guidelines below:

  • If you need to initialize shared state that does not change across rows, initialize it in the UDF class’s constructor.

  • Write your handler method to be thread safe.

  • Avoid storing and sharing dynamic state across rows.

If your UDF cannot follow these guidelines, or if you would like a deeper understanding of the reasons for these guidelines, please read the next few subsections.

Introduction

Snowflake expects scalar UDFs to be processed independently. Relying on state shared between invocations can result in unexpected behavior, as the system can process rows in any order and spread those invocations across several JVMs. UDFs should avoid relying on shared state across calls to the handler method. However, there are two situations in which you might want a UDF to store shared state:

  • Code that contains expensive initialization logic that you do not want to repeat for each row.

  • Code that leverages shared state across rows, such as a cache.

If you need to share state across multiple rows, and if that state does not change over time, then use a constructor to create shared state by setting instance-level variables. The constructor is executed only once per instance, while the handler is called once per row, so initializing in the constructor is cheaper when a handler processes multiple rows. And because the constructor is called only once, the constructor does not need to be written to be thread-safe.

If your UDF stores shared state that changes, then your code must be prepared to handle concurrent access to that state. The next two sections provide more information about parallelism and shared state.

Understanding Java UDF Parallelization

To improve performance, Snowflake parallelizes both across and within JVMs.

  • Across JVMs:

    Snowflake parallelizes across workers in a warehouse. Each worker runs one (or more) JVMs. This means that there is no global shared state. At most, state can be shared only within a single JVM.

  • Within JVMs:

    • Each JVM can execute multiple threads that can call the same instance’s handler method in parallel. This means that each handler method needs to be thread-safe.

    • If a SQL statement calls the same UDF more than once with the same arguments for the same row, then the UDF returns the same value for each call for that row. For example, the following returns the same value twice for each row:

      select
             my_java_udf(42),
             my_java_udf(42)
          from table1;
      

      If you would like multiple calls to return independent values even when passed the same arguments, then bind multiple separate UDFs to the same handler method. For example:

      1. Create a JAR file named @java_udf_stage/rand.jar with code:

        class MyClass {
        
            private double x;
        
            // Constructor
            public MyClass()  {
                x = Math.random();
            }
        
            // Handler
            public double myHandler() {
                return x;
            }
        }
        
      2. Create the Java UDFs as shown below. These UDFs have different names, but use the same JAR file and the same handler within that JAR file.

        create my_java_udf_1()
            returns double
            language java
            imports = ('@java_udf_stage/rand.jar')
            handler = 'MyClass.myHandler';
        
        create my_java_udf_2()
            returns double
            language java
            imports = ('@java_udf_stage/rand.jar')
            handler = 'MyClass.myHandler';
        
      3. The following code calls both UDFs. The UDFs point to the same JAR file and handler. These calls create two instances of the same class. Each instance returns an independent value, so the example below returns two independent values, rather than returning the same value twice:

        select
                my_java_udf_1(),
                my_java_udf_2()
            from table1;
        

Storing JVM State Information

One reason to avoid relying on dynamic shared state is that rows are not necessarily processed in a predictable order. Each time a SQL statement is executed, Snowflake can vary the number of batches, the order in which batches are processed, and the order of rows within a batch. If a scalar UDF is designed so that one row affects the return value for a subsequent row, then the UDF can return different results each time that the UDF is executed.

Handling Errors

A Java method used as a UDF can use the normal Java exception-handling techniques to catch errors within the method.

If an exception occurs inside the method and is not caught by the method, Snowflake raises an error that includes the stack trace for the exception.

You can explicitly throw an exception without catching it in order to end the query and produce a SQL error. For example:

if (x < 0)  {
    throw new IllegalArgumentException("x must be non-negative.");
    }

When debugging, you can include values in the SQL error message text. To do so, place an entire Java method body in a try-catch block; append argument values to the caught error’s message; and throw an exception with the extended message. To avoid revealing sensitive data, remove argument values prior to deploying JAR files to a production environment.

Following Best Practices

  • Write platform-independent code.

    • Avoid code that assumes a specific CPU architecture (e.g. x86).

    • Avoid code that assumes a specific operating system.

  • If you need to execute initialization code and do not want to include it in the method that you call, you can put the initialization code into a static initialization block.

See also:

Following Good Security Practices

  • Your method (and any library methods that you call) must act as a pure function, acting only on the data it receives and returning a value based on that data, without causing side-effects. Your code should not attempt to affect the state of the underlying system, other than consuming a reasonable amount of memory and processor time.

  • Java UDFs are executed within a restricted engine. Neither your code nor the code in library methods that you use should employ any prohibited system calls, including:

    • Process control. For example, you cannot fork a process. (However, you can use multiple threads.)

    • File system access.

      With the following exceptions, Java UDFs should not read or write files:

      • Java UDFs can read files specified in the imports clause of the CREATE FUNCTION command. For more information, see CREATE FUNCTION.

      • Java UDFs can write files, such as log files, to the /tmp directory.

        Each query gets its own memory-backed file system in which its own /tmp is stored, so different queries cannot have file name conflicts.

        However, conflicts within a query are possible if a single query calls more than one UDF, and those UDFs try to write to the same file name.

    • Network access.

      Note

      Because your code cannot access the network directly or indirectly, you cannot use the code in the Snowflake JDBC Driver to access the database. Your UDF cannot itself act as a client of Snowflake.

  • JNI (Java Native Interface) is not supported. Snowflake prohibits loading libraries that contain native code (as opposed to Java bytecode).