External Function Concepts¶
This topic provides a high-level description of how external functions work.
In this Topic:
How External Functions Work¶
Snowflake does not call a remote service directly. Instead, Snowflake calls the remote service through a cloud provider’s native HTTPS proxy service, for example API Gateway on AWS.
The main steps to call an external function are:
A user’s client program passes Snowflake a SQL statement that calls an external function.
When evaluating the external function as part of the query execution, Snowflake reads the external function definition, which contains the URL of the proxy service and the name of the API integration that contains authentication information for that proxy service.
Snowflake reads information from the API integration and composes an HTTP POST request that contains:
HTTP header information. (Details are documented in CREATE EXTERNAL FUNCTION.)
The data to be processed. This data is in JSON format.
The proxy service “resource” to use. The resource contains information about the remote service, such as the location of that service.
The authentication information for that proxy service resource.
The POST request is sent to the proxy service.
The proxy service receives the POST and then processes and forwards the request to the actual remote service. You can loosely think of the proxy service and resource as a “relay function” that calls the remote service.
The remote service processes the data and returns the result, which is passed back through the chain to the original SQL statement.
This diagram illustrates the steps described above:
Before a user can call an external function, developers and Snowflake account administrators must configure Snowflake to access the proxy service. Typically, the steps are done in approximately the order shown below (starting from the right-hand side of the diagram above and moving leftward towards Snowflake).
A developer must write the remote service, and that remote service must be exposed via the HTTPS proxy service. For example, the remote service might be a Python function running on AWS Lambda and exposed via a resource in the AWS API Gateway.
In Snowflake, an ACCOUNTADMIN or a role with the CREATE INTEGRATION privilege must create an “API integration” object that contains authentication information that enables Snowflake to communicate with the proxy service. The API integration is created with the SQL command CREATE API INTEGRATION.
A Snowflake user must execute the SQL command CREATE EXTERNAL FUNCTION. The user must use a role that has USAGE privilege on the API integration and has sufficient privileges to create functions.
The CREATE EXTERNAL FUNCTION command does not actually create an external function in the sense of loading code that will be “executed outside Snowflake”. Instead, the CREATE EXTERNAL FUNCTION command creates a database object that indirectly references the code that executes outside Snowflake. More precisely, the CREATE EXTERNAL FUNCTION command creates an object that contains:
The URL of the resource in the HTTPS proxy service that acts as a relay function.
The name of the API integration to use to authenticate to the proxy service.
A name that is effectively an alias for the remote service. This alias is used in SQL commands, for example
SELECT MyAliasForRemoteServiceXYZ(col1) ...;
The alias in Snowflake, the HTTPS proxy service resource’s name, and the remote service’s name can all be different. (Using the same name for all three can simplify administration, however.)
Although the steps described above are the most common way of executing an external function, some variations are allowed. For example:
The remote service might not be the final step in the chain; the remote service could call yet another remote service to do part of the work.
If the remote service doesn’t accept and return JSON-formatted data, then the HTTPS proxy service’s resource (the relay function) could convert the data from JSON format to another format (and convert the returned data back to JSON).
Although Snowflake recommends that the remote service behave as a true function (i.e. a piece of code that accepts 0 or more input parameters and returns an output) that has no side effects and keeps no state information, this is not strictly required. The remote service could perform other tasks, for example sending alerts if a value (such as a temperature reading in the data) is dangerously high. In rare cases, the remote service might keep state information, for example the total number of alerts issued.
The remote service, the proxy service, and any other steps between Snowflake and the remote service, must be able to handle the peak workloads sent to them.
Scalability of the Remote Service¶
Developers who write remote services should consider:
The frequency with which the remote service will be called.
The number of rows sent per call.
The resources required to process each row.
The time distribution of calls (peak vs. average).
Capacity might need to increase over time as the callers change from a few developers and testers to an entire organization. If the remote service is used by multiple organizations, capacity might need to increase as the number of organizations increases. Furthermore, as the number and diversity of organizations increase, the size and timing of workloads might become more difficult to predict.
The remote service provider is responsible for providing enough capacity to handle peak workloads. Different techniques can be used to scale the service. If the remote service is managed by the author of the remote service, then the author might need to explicitly provision the service with enough capacity to handle peaks. Alternatively, the author might decide to use a hosted auto-scaled/elastic service, such as AWS Lambda.
Remote services should consider returning 429 when overloaded. If Snowflake sees an HTTP response code of 429, Snowflake scales back the rate at which it sends rows, and retries sending batches of rows that were not processed successfully.
For more information about troubleshooting scalability issues, see Scalability.
Scalability of the Proxy Service¶
The proxy service should also be scalable. Fortunately, proxy services provided by major cloud providers are generally scalable. However, users who develop or administer external functions should remember the following information:
- AWS API Gateway
The AWS API Gateway is itself a managed AWS service, which auto-scales to users workload. Users should be familiar with various limits with API Gateway: https://docs.aws.amazon.com/apigateway/latest/developerguide/limits.html.
The AWS API Gateway can be configured to help scale the remote service. Specifically, the API Gateway can be configured to enable caching and/or throttling of requests to reduce the load on the remote service if needed:
Since throttling can affect timeouts and retries, users might also want to review information about how Snowflake handles timeouts and retries:
Resource requirements depend upon the way that rows are distributed across calls (many parallel calls with a few rows each vs. one call with the same total number of rows). A system that supports high capacity does not necessarily support high concurrency, and vice-versa. You should estimate the peak concurrency required, as well as the largest reasonable individual workloads, and provide enough resources to handle both types of peaks.
Furthermore, the concurrency estimate should take into account that Snowflake can parallelize external function calls. A single query from a single user might cause multiple calls to the remote service in parallel. Several factors affect the number of concurrent calls from Snowflake to a proxy service or remote service, including:
The number of concurrent users who are running queries with external functions.
The size of each user’s query.
The number of servers in the virtual warehouse (i.e. the warehouse size).
The number of warehouse clusters.
Handling concurrency properly can be particularly complex if external functions have side effects. The results can vary depending upon the order in which user’s rows are processed. (Snowflake recommends that you avoid writing or using remote services that have side effects.)
Depending upon where the remote service is running, you might need to consider:
Upgrading (if the remote service might add new features or need bug fixes).
If the remote service is not stateless, you might also need to consider recovery after failure. (Snowflake strongly recommends that remote services be stateless.)
For information about timeouts and retries, see Account for Timeout Errors and Do Not Assume that the Remote Service is Passed Each Row Exactly Once.