Platform-Independent Information for Creating External Functions

Much of the information related to developing a proxy service and remote service is platform-dependent. However, there are many additional details that are platform-independent.

This topic describes these platform-independent details.

In this Topic:

Remote Service

Data Format

Snowflake makes a call to a resource in a proxy service by issuing an HTTP POST request. This POST contains data in a specific format, along with headers. The data returned as a result of the POST must also conform to a specific format. Both formats are described below.

Note also the importance of using compatible data types in both the sender and receiver of the POST. For more details, see also Ensure that Arguments to the External Function Correspond to Arguments Parsed By the Remote Service.

Data Format Sent from Snowflake

Each HTTP request made from Snowflake is a POST. The POST request contains headers and a body. Metadata is passed in HTTP headers. Data is batched and passed in the POST’s request body.

The header information includes:

  • Metadata described in the “headers” and “context_headers” properties of CREATE EXTERNAL FUNCTION.

  • The following HTTP headers:

    • Headers that describe how data is serialized in the request body:

      • “sf-external-function-format”: This is currently always set to “json”.

      • “sf-external-function-format-version”: This is currently always set to “1.0”.

    • “sf-external-function-current-query-id”: This contains the query ID of the query that called this external function. You can use this to correlate Snowflake queries to calls of the remote service, for example to help debug issues.

    • The sf-external-function-query-batch-id: The batch ID uniquely identifies the specific batch of rows processed with this request. The remote service can use this ID to track the status of a batch that is being processed. The ID can also be used as an idempotency token if requests are retried due to an error. The ID can also be used for logging/tracing of requests by the remote service.

      The batch ID is an opaque value generated by Snowflake. The format could change in future releases, so remote services should not rely on a specific format or try to interpret the value.

The body of the POST request contains the data, serialized in JSON format. The schema for the JSON is:

  • The top-level JSON object is a dictionary of name/value pairs.

  • Currently, there is exactly one item in the dictionary; the key for that item is named “data”.

  • That “data” item’s value is a JSON array, in which each element of the array is one row of data.

  • Each row of data is a JSON array of one or more columns.

  • The first column is always the row number (i.e. the 0-based index of the row within the batch).

  • The remaining columns contain the arguments to the function.

  • Data types are serialized as follows:

    • Numbers are serialized as JSON numbers.

    • Booleans are serialized as JSON booleans.

    • Strings are serialized as JSON strings.

    • Variants are serialized as JSON objects.

    • All other supported data types are serialized as JSON strings.

    • NULL is serialized as JSON null.

Here’s an example of a serialized request for a remote service with the signature f(integer, integer, varchar, timestamp):

{
    "data": [
                [0, 10, "Alex", "Wed, 01 Jan 2014 16:00:00 -0800"],
                [1, 20, "Steve", "Wed, 01 Jan 2015 16:00:00 -0800"],
                [2, 30, "Alice", "Wed, 01 Jan 2016 16:00:00 -0800"],
                [3, 40, "Adrian", "Wed, 01 Jan 2017 16:00:00 -0800"]
            ]
}

Optionally, the JSON can be compressed for transmission over the network. Compression is documented in CREATE EXTERNAL FUNCTION.

Snowflake sends this data to the proxy service, not directly to the remote service. Therefore, the proxy service must receive (and return) data in a Snowflake-compatible format. Although typically the proxy service passes data through unchanged, the proxy can reformat data (both sending and receiving) to meet the needs of both the remote service and Snowflake.

For simplicity, and to help illustrate the formats that Snowflake expects to send and receive, most of the examples in this section assume that the remote service reads and writes data in the same format as Snowflake expects, and the proxy service passes data through unchanged in both directions.

Data Format Received by Snowflake

The format of data received by Snowflake is similar to the format of the data sent by Snowflake. The returned value is in JSON format.

{
    "data":
        [
            [ 0, 1995 ],
            [ 1, 1974 ],
            [ 2, 1983 ],
            [ 3, 2001 ]
        ]
}

The JSON response returned to Snowflake should contain one row for each row sent by Snowflake. Each returned row contains two values:

  • The row number (i.e. the 0-based index of the row within the batch).

  • The value returned from the function for that row. The value can be a compound value (for example, a VARIANT), but it must be exactly one value because all scalar Snowflake functions (external or otherwise) return a single value.

The row numbers in the returned data must correspond to the row numbers in the data that Snowflake sent and must be returned in the same order as they were received.

This JSON is the body of the HTTP response. The response also contains HTTP headers and a status code (e.g. status code 200 indicates success).

Optionally, the JSON can be compressed for transmission over the network. Compression is documented in CREATE EXTERNAL FUNCTION.

Snowflake recognizes the following HTTP status codes:

Code

Description

200

No error

Other values are treated as errors.

This list of status codes might expand over time.

For information about timeouts and retries, see Account for Timeout Errors and Do Not Assume that the Remote Service is Passed Each Row Exactly Once.

Best Practices

The following best practices apply to most external functions, especially the remote service portion.

Process One Row at a Time

To minimize networking overhead, Snowflake typically batches rows to send to remote services. The number of batches and the size of each batch can vary.

In addition, the order of batches can vary, and the order of rows within a batch can vary. Even if the query contains an ORDER BY clause, the ORDER BY is usually applied after the external function(s) have been called.

Because batch size and row order are not guaranteed, writing a function that returns a value for a row that depends upon any other row in this batch or previous batches can produce non-deterministic results.

Snowflake strongly recommends that the remote service process each row independently. The return value for each input row should depend on only that input row, not on other input rows. (Currently, external functions do not support window functions, for example.)

Note also that because batch size is not guaranteed, counting batches is not meaningful.

See also Ensure Your External Function is Stateless.

Do Not Assume that the Remote Service is Passed Each Row Exactly Once

If Snowflake calls a remote service, and the remote service receives the request and returns a result, but Snowflake does not receive the result due to a temporary network problem, Snowflake might repeat the request. If Snowflake retries, the remote service might see the same row twice (or more).

This can cause unexpected effects. For example, because the remote service might get called more than once for the same value, a remote service that assigns unique IDs might have gaps in the sequence of those IDs. In some cases, such effects can be reduced by tracking the batch ID in the sf-external-function-query-batch-id field of the request header to determine whether a particular batch of rows has been processed previously. When Snowflake retries a request for a specific batch, Snowflake uses the same batch ID as it used earlier for the same batch.

Snowflake retries when it receives the following errors:

  • All transient network transport errors.

  • All requests that fail with 429 status code.

  • All requests that fail with 5XX status code.

Requests are retried with exponential backoff until a total retry timeout is reached. The total retry timeout is currently undocumented and not user-configurable. Snowflake might adjust this limit in the future.

When the total retry timeout is reached without a successful retry, the query fails.

If your external function call times out when the remote service is working, and all the elements between Snowflake and the remote service seem to be working, you can try a smaller batch size to see if that reduces the timeout errors.

To learn how to set the maximum batch size, see CREATE EXTERNAL FUNCTION.

Ensure Your External Function is Stateless

All aspects of an external function (including the remote service) should avoid storing state information, both:

  • Internal state (state that the remote service stores internally).

  • External state (state stored outside the remote service, for example state information sent to and/or read from another remote service that itself retains state).

If the remote service changes state information and then uses that information to affect future outputs, the function might return different values than expected.

For example, consider a simple remote service that contains an internal counter and returns the number of rows received since the remote service first started. If there is a temporary network problem, and Snowflake repeats a request with the same data, the remote service will count the re-sent rows twice (or more).

For an example involving external state, see Avoid Side-effects.

In the very rare cases where a function is not stateless, the documentation for callers should say clearly that the function is not stateless, and the function should be marked volatile.

Avoid Side-effects

An external function (including the remote service) should avoid side effects, such as changing external state (information stored outside the remote service).

For example, if the remote service reports out-of-range values to a government agency, that is a side effect.

Side-effects can be useful, but the side-effects of calling an external function are not always predictable. For example, suppose that you call a remote service that analyzes an anonymized health record and returns a diagnosis. Suppose also that if the diagnosis is that the patient has a contagious disease, then the diagnosis is reported to an agency that keeps count of the number of cases of that disease. This is a useful side effect. However, it is vulnerable to problems such as:

  • If an external function call is inside a transaction that is rolled back, the side effects are not rolled back.

  • If the remote service is called more than once with the same row (e.g. due to temporary network failures and retries), the side-effect could occur more than once. For example, an infected patient might be counted twice in the statistics.

There are also situations in which rows could be undercounted rather than overcounted.

In the very rare cases where a function has side effects, the documentation for callers should say clearly what the side effects are, and the function should be marked volatile.

Categorize Your Function as Volatile or Immutable

Functions can be categorized as volatile or immutable. (The CREATE EXTERNAL FUNCTION statement allows the user to specify whether the function is volatile or immutable.)

For an external function to be considered immutable, it should meet the following criteria:

  • If given the same input value, the function returns the same output value. (For example, the SQRT function returns the same output when given the same input, but the CURRENT_TIMESTAMP function does not necessarily return the same output when given the same input.)

  • The function has no side effects.

If a function meets these two criteria, then Snowflake can use certain types of optimizations to reduce the number of rows or batches sent to the remote service. (These optimizations might evolve over time, and are not described in detail here.)

Snowflake cannot detect or enforce immutability, or factors that affect immutability (for example, side effects). The writer of a remote service should document whether the remote service meets the criteria to be labeled immutable. If a remote service has side effects, then the external function that calls that remote service should be marked volatile, even if the function call returns the same output value for the same input value. If you are not certain that a remote service is immutable, then any external function that calls that remote service should be labeled volatile.

Account for Timeout Errors

An external function call involves Snowflake, a remote service, a proxy service, and potentially other elements in the chain. None of these elements know how long a particular function call should take, so none know exactly when to stop waiting and return a timeout error. Each step might have its own independent timeout. For more information about timeouts, see Do Not Assume that the Remote Service is Passed Each Row Exactly Once.

Minimize Latency

To minimize latency and improve performance of external function calls, Snowflake recommends doing the following when practical:

  • Put the API Gateway in the same cloud platform and region as Snowflake instances that call it most frequently (or with the largest amount of data).

  • If you wrote the remote service (rather than using an existing service), deploy that remote service in the same cloud platform and region as it is called from.

  • Send as little data as possible. For example, if the remote service will examine inputs values and operate on only a subset of them, then it is usually more efficient to filter in SQL and send only the relevant rows to the remote service, rather than send all rows to the remote service and let it filter.

    As another example, if you are processing a column that contains large semi-structured data values, and the remote service will operate on only a small piece of each of those data values, it is usually more efficient to extract the relevant piece using Snowflake SQL and send only that piece, rather than send the entire column and have the remote service do the extraction of the small piece before processing.

Develop and Test External Functions One Step at a Time

Snowflake recommends that you test without Snowflake before testing with Snowflake.

During the early steps of developing an external function, use the cloud platform proxy service console (e.g. the AWS API Gateway console) and remote service development console (e.g. the AWS Lambda console) to help develop and test the proxy service and remote service.

For example, if you have developed a Lambda function, you might want to test it extensively through the Lambda console before testing it by calling it from Snowflake.

Testing through the proxy service console and remote service console usually has the following advantages:

  • It can make diagnosing the problem easier because there are fewer places to look for the cause of the problem.

  • Viewing the data payload might provide useful debugging information. Snowflake does not show any portion of the data payload in error messages; although this enhances security, it can slow debugging.

  • Snowflake auto-retries HTTP 5xx errors, which can make debugging slower or more difficult in some situations.

  • Testing through Snowflake consumes Snowflake credits in addition to cloud platform credits.

Of course, after you’ve tested the remote service and the proxy service as much as you can without Snowflake, you should test them with Snowflake. The advantages of testing with Snowflake include:

  • You’re testing all the steps involved in the external function.

  • Using a Snowflake table as the data source makes it easy to test with large volumes of data to get a realistic estimate of the performance of the external function.

Consider the following test cases:

  • NULL values.

  • “Empty” values (for example, empty strings, empty semi-structured data types).

  • Very long VARCHAR and BINARY values, if appropriate.

Ensure that Arguments to the External Function Correspond to Arguments Parsed By the Remote Service

When passing arguments to or from an external function, ensure that the data types are appropriate. If the value sent can’t fit into the data type being received, the value might be truncated or corrupted, or the remote service call might fail.

For example, because some Snowflake SQL numeric data types can store larger values than commonly-used JavaScript data types, de-serializing large numbers from JSON is particularly sensitive in JavaScript.

If you write your own remote service, and if you change the number, data types, or order of the arguments to the remote service, remember to make the corresponding changes to the external function. Currently, the ALTER EXTERNAL FUNCTION command does not have an option to change parameters, so you must drop and re-create the external function to change the arguments.

Proxy Service

Secure Your Proxy Service Endpoint

Snowflake strongly recommends securing your proxy service endpoints.

Snowflake uses credential-less API integration objects to authenticate to the proxy service endpoint. Credential-less API integrations separate responsibilities between administrators and users. An API integration allows an administrator to create a trust policy between Snowflake and the cloud provider using the cloud provider’s native authentication and authorization mechanism. When Snowflake connects to the cloud provider, the cloud provider authenticates and authorizes access through this trust policy. Using a specific API integration, the administrator can also restrict which proxy services and resources Snowflake can use, enabling the administrator to enforce organizational policies for data egress and ingress.

More detailed instructions for securing specific proxy service endpoints, such as an AWS API Gateway, are in the platform-specific instructions.