Remote Service Input and Output Data Formats

When Snowflake sends data to a remote service, or receives data from a remote service, the data must be formatted correctly. This topic provides information about the proper data formats.

Data received from and returned to Snowflake must also be of an appropriate data type .

In this Topic:

Data Format Sent by Snowflake

Each HTTP request from Snowflake is a POST or a GET.

A POST request contains headers and a request body. The request body includes a batch of rows.

A GET contains only headers, and is used only for polling when the remote service returns results asynchronously.

The header information is generally available to the remote service as a set of key/value pairs. The header information includes:

  • The following HTTP headers:

    • Headers that describe how data is serialized in the request body:

      • “sf-external-function-format”: This is currently always set to “json”.

      • “sf-external-function-format-version”: This is currently always set to “1.0”.

    • “sf-external-function-current-query-id”: This contains the query ID of the query that called this external function. You can use this to correlate Snowflake queries to calls of the remote service, for example to help debug issues.

    • The “sf-external-function-query-batch-id”: The batch ID uniquely identifies the specific batch of rows processed with this request. The remote service can use this ID to track the status of a batch that is being processed. The ID can also be used as an idempotency token if requests are retried due to an error. The ID can also be used for logging/tracing of requests by the remote service.

      The batch ID in a GET is the same as the batch ID in the corresponding POST.

      The batch ID is an opaque value generated by Snowflake. The format could change in future releases, so remote services should not rely on a specific format or try to interpret the value.

    • Headers that describe the signature (name and argument types) and return type of the external function that was called in the SQL query. These values can have characters that are not standard characters for Snowflake identifiers, so base64 versions of the information are included, and non-standard characters are replaced with a blank in the non-base64 versions. The specific headers are:

      • sf-external-function-name

      • sf-external-function-name-base64

      • sf-external-function-signature

      • sf-external-function-signature-base64

      • sf-external-function-return-type

      • sf-external-function-return-type-base64

      For example, the headers sent for the function ext_func(n integer)  returns varchar are:

      • sf-external-function-name: ext_func

      • sf-external-function-name-base64: <base64 value>

      • sf-external-function-signature: (N NUMBER)

      • sf-external-function-signature-base64: <base64 value>

      • sf-external-function-return-type: VARCHAR(16777216)

      • sf-external-function-return-type-base64: <base64 value>

      Note that because SQL INTEGER values are treated as SQL NUMBER, the SQL argument declared as type INTEGER is described as type NUMBER.

  • Additional optional metadata described in the “headers” and “context_headers” properties of CREATE EXTERNAL FUNCTION.

For example, to extract the “sf-external-function-signature” header from inside an AWS Lambda function written in Python, which receives the headers as a python dictionary, execute the following:

def handler(event, context):

    request_headers = event["headers"]
    signature = request_headers["sf-external-function-signature"]

The details will be different for other languages and on other cloud platforms.

(For remote services developed on AWS, more information about headers and lambda proxy integration is available at the API Gateway documentation .)

The body of the POST request contains the data, serialized in JSON format. The schema for the JSON is:

  • The top-level is a JSON object (a set of name/value pairs, also called a “dictionary”).

  • Currently, there is exactly one item in that object; the key for that item is named “data”.

  • That “data” item’s value is a JSON array, in which each element of the array is one row of data.

  • Each row of data is a JSON array of one or more columns.

  • The first column is always the row number (i.e. the 0-based index of the row within the batch).

  • The remaining columns contain the arguments to the function.

  • Data types are serialized as follows:

    • Numbers are serialized as JSON numbers.

    • Booleans are serialized as JSON booleans.

    • Strings are serialized as JSON strings.

    • Variants are serialized as JSON objects.

    • All other supported data types are serialized as JSON strings.

    • NULL is serialized as JSON null.

Here’s an example of a serialized request for an external function with the signature f(integer, varchar, timestamp). Note that the first column is the row number within the batch, and the next 3 values are the arguments to the external function.

{
    "data": [
                [0, 10, "Alex", "2014-01-01 16:00:00"],
                [1, 20, "Steve", "2015-01-01 16:00:00"],
                [2, 30, "Alice", "2016-01-01 16:00:00"],
                [3, 40, "Adrian", "2017-01-01 16:00:00"]
            ]
}

Examples of extracting data are included in the documentation for creating a remote service on each platform:

Optionally, the JSON can be compressed for transmission over the network. Compression is documented in CREATE EXTERNAL FUNCTION.

Snowflake sends this data to the proxy service, not directly to the remote service. Therefore, the proxy service must receive (and return) data in a Snowflake-compatible format. Although typically the proxy service passes data through unchanged, the proxy can reformat data (both sending and receiving) to meet the needs of both the remote service and Snowflake.

For simplicity, and to help illustrate the formats that Snowflake expects to send and receive, most of the examples in this section assume that the remote service reads and writes data in the same format as Snowflake expects, and the proxy service passes data through unchanged in both directions.

Data Format Received by Snowflake

When the remote service finishes processing a batch, the remote service should send data to Snowflake in a format similar to the format of the data sent by Snowflake. The returned value is in JSON format. Here is an example of the data section of such a response:

{
    "data":
        [
            [ 0, 1995 ],
            [ 1, 1974 ],
            [ 2, 1983 ],
            [ 3, 2001 ]
        ]
}

The JSON response returned to Snowflake should contain one row for each row sent by Snowflake. Each returned row contains two values:

  • The row number (i.e. the 0-based index of the row within the batch).

  • The value returned from the function for that row. The value can be a compound value (for example, a VARIANT), but it must be exactly one value because all scalar Snowflake functions (external or otherwise) return a single value.

The row numbers in the returned data must correspond to the row numbers in the data that Snowflake sent and must be returned in the same order as they were received.

The code below shows an example that contains a VARIANT value after the row number:

{
    "data":
        [
            [ 0, { "City" : "Warsaw",  "latitude" : 52.23, "longitude" :  21.01 } ],
            [ 1, { "City" : "Toronto", "latitude" : 43.65, "longitude" : -79.38 } ]
        ]
}

In Python, the code to compose one returned row, including its VARIANT value, would be similar to the following:

...
row_number = 0
output_value = {}

output_value["city"] = "Warsaw"
output_value["latitude"] = 21.01
output_value["longitude"] = 52.23
row_to_return = [row_number, output_value]
...

To access the elements of the returned VARIANT in a SQL statement, use the same notation described in Traversing Semi-structured Data. For example:

select val:city, val:latitude, val:longitude
    from (select ext_func_city_lat_long(city_name) as val from table_of_city_names);

The response also contains an HTTP status code. Snowflake recognizes the following HTTP status codes:

Code

Description

200

Batch processed successfully.

202

Batch received and still being processed.

Other values are treated as errors.

This list of status codes might expand over time.

The response can also contain the following optional HTTP headers:

  • Content-MD5: Snowflake uses the optional Content-MD5 header to check the integrity of the response. If this header is included in the response, Snowflake computes an MD5 checksum on the response body to ensure that it matches the corresponding checksum in the returned header. If the values do not match, the SQL query fails. The checksum should be encoded in a base64 representation before being returned in the header. See the example code below.

Optionally, the JSON can be compressed for transmission over the network. Compression is documented in CREATE EXTERNAL FUNCTION.

For information about timeouts and retries, see Account for Timeout Errors and Do Not Assume that the Remote Service is Passed Each Row Exactly Once.

The example Python code below returns a proper response, including the HTTP response code, the processed data, and an MD5 header (which is optional). (This was written using Python 3.8.)

This example is based on an AWS Lambda function. Some code might need customization for different platforms.

import json
import hashlib
import base64

def handler(event, context):

    # The return value should contain an array of arrays (one inner array per input row for a scalar function).
    array_of_rows_to_return = [ ]

    ...

    json_compatible_string_to_return = json.dumps({"data" : array_of_rows_to_return})

    # Calculate MD5 checksum for the response
    md5digest = hashlib.md5(json_compatible_string_to_return.encode('utf-8')).digest()
    response_headers = {
        'Content-MD5' : base64.b64encode(md5digest)
    }

    # Return the HTTP status code, the processed data, and the headers (including the Content-MD5 header).
    return {
        'statusCode': 200,
        'body': json_compatible_string_to_return,
        'headers': response_headers
    }