Deploy models for Real time Inference (REST API)¶
Note
Generally available since snowflake-ml-python version 1.25.0.
Use real-time inference for interactive workflows that require low latency. You can deploy any model from the Snowflake Model Registry as a managed service with a dedicated HTTP endpoint. Managed services feature autoscaling and are fully integrated within the Snowflake ecosystem, offering comprehensive observability.
Use online inference for your workflow when:
Your application requires low latency for immediate responses
Your model serves as a backend for a user-facing web or mobile application.
The input to your model can fit within the HTTP payload of the request.
The service must automatically scale horizontally to handle fluctuating request volumes.
How It Works¶
Snowflake simplifies the deployment pipeline by hosting your model as an HTTP server within Snowpark Container Services (SPCS). This architecture enables you to:
Abstract Complexity: Deploy sophisticated models without managing Docker images or Kubernetes clusters.
Scale Performance: Run large-scale models on distributed GPU clusters for high-performance requirements.
Ensure Reliability: Utilize built-in observability, traffic splitting, and shadow/canary deployments for seamless model upgrades.
Prerequisites¶
Before you begin, make sure you have the following:
A Snowflake account in any commercial AWS, Azure, or Google Cloud region. Government regions are not supported.
Version 1.8.0 or later of the snowflake-ml-python Python package.
A model logged into Snowflake Model Registry.
An understanding of compute pools and related privileges on SPCS.
Required privileges¶
Model Serving runs on top of Snowpark Container Services. You need the following privileges to use Model Serving:
USAGE or OWNERSHIP on the compute pool where the service runs. Alternatively, you can use the default System Compute Pools.
BIND SERVICE ENDPOINT privilege on account to be able to create a public endpoint.
OWNER or READ privilege on the Model
Limitations¶
The following limitations apply to online model serving in Snowpark Container Services.
Table functions aren’t supported. Your model must have a table function to be deployed to Snowflake.
Models developed using Snowpark ML modeling classes can’t be deployed to environments that have a GPU. As a workaround, you can extract the native model and deploy that. For more information, see Deploy a model for online inference.
Deploy a model for online inference¶
Snowflake ML uses a model version object to create a model service that handles inference requests. To create a model version object, you can either log a new model version or obtain a reference to an existing model version. After you get your model version object, you can use the following Python code to create a model service and deploy that service to SPCS:
# reg is a snowflake.ml.registry.Registry object
example_mv_object = reg.get_model("mymodel_name").version("version_name") # a snowflake.ml.model.ModelVersion object
example_mv_object.create_service(service_name="myservice",
service_compute_pool="my_compute_pool",
ingress_enabled=True,
gpu_requests=None)
create_service requires the following arguments:
service_name: The name of the service that you’re creating. This name must be unique within your Snowflake account.
service_compute_pool: The name of the compute pool that you’re using to run the model. The compute pool must already exist. If the model fits well in System Compute Pools, you can use them (
SYSTEM_COMPUTE_POOL_GPUorSYSTEM_COMPUTE_POOL_CPU) too.ingress_enabled: This is required to be True to call online inference from outside of Snowflake.
gpu_requests: A string specifying the number of GPUs. For a model that can be run on either a CPU or multiple GPUs, this argument determines whether the model will be run on the CPU or on the GPUs. If the model is of a known type that can only run on a CPU (such as scikit-learn models), the image build fails if you request GPUs. If you’re deploying a new model, it can take up to 10 minutes to create the service for CPU-powered models and 20 minutes for GPU-powered models. If the compute pool is idle or requires resizing, it might take longer to create the service.
The preceding example only shows the required and most commonly used arguments. For a complete list of arguments, see the ModelVersion API reference.
Default service configuration¶
The server running the model that you’ve deployed uses defaults that work for most use-cases:
Number of Worker Threads: For a CPU-powered model, the number of processes that the server uses is twice the number of CPUs plus one. GPU-powered models use one worker process. You can override this using the num_workers argument in the create_service call. It is recommended to specify the smallest GPU node where the model fits into memory. Scale by increasing the number of instances. For example, if the model fits in the GPU_NV_S (GPU_NV_SM on Azure) instance type, use gpu_requests=1 and scale up by increasing max_instances. However if the smallest available node has 4 GPUs and you only need 2, use
num_workers=2(that is, gpu available / gpus needed by the model).Thread Safety: Some models are not thread-safe. Therefore, the service loads a separate copy of the model for each worker process. This can result in resource depletion for large models.
Node Utilization: By default, one inference server instance requests the whole node by requesting all the CPU and memory of the node it runs on. To customize resource allocation per instance, use arguments like cpu_requests, memory_requests, and gpu_requests.
Endpoint: The inference endpoint is named inference and uses port 5000. These cannot be customized. For optimal resource utilization, specify the smallest GPU node that can fit the model into memory. Increase the number of instances to scale to your workload. For example, if the model fits in the GPU_NV_S (GPU_NV_SM on Azure) instance type, use gpu_requests=1 and scale up by increasing max_instances.
Container image build behavior¶
The Snowflake conda channel is available only in warehouses and is the only source for warehouse dependencies. By default, conda dependencies for SPCS models obtain their dependencies from conda-forge.
By default, Snowflake Model Serving builds the container image using the same compute pool that’s used to run the model. The compute pool is likely overpowered for the process of building images (for example, GPUs are not used in building container images). For the most part, this won’t significantly impact compute costs. However, if you’re concerned, you can specify a less powerful compute pool to build images with the image_build_compute_pool argument.
Calling create_service() multiple times does not trigger a build every time you call it.
However, container images might be rebuilt if Snowflake made updates to the inference service, including fixes for vulnerabilities in dependent packages. When this happens, create_service automatically triggers a rebuild of the image.
User interface¶
You can manage deployed models in the Model Registry Snowsight UI. For more information, see Model inference services.
Invoking deployed model¶
HTTP endpoints¶
Every service comes with its internal DNS name. Deploying a service with ingress_enabled also creates a public HTTP endpoint available outside of Snowflake. Either endpoint can be used to call a service.
You can find the public HTTP endpoint of a service with ingress enabled using the SHOW ENDPOINTS command. The output contains an ingress_url column, which has an entry of the format unique-service-id-account-id.snowflakecomputing.app. This is the publicly available HTTP endpoint for your service. For private link users, use privatelink_ingress_url instead of ingress_url.
To get the internal DNS name on Snowflake, use the DESCRIBE SERVICE command. The dns_name column of output from this command contains a service’s internal DNS name. To find your service’s port, use the SHOW ENDPOINTS IN SERVICE command. The port or port_range column contains the port used by a service. You can make internal calls to your service through the URL http://dns_name:port.
To call any particular methods of the model, use the method name as path to the URL (eg
https://unique-service-id-account-id.snowflakecomputing.app/method-name or http://dns_name:port/<method-name>). In a URL,
underscores (_) in the method name are replaced by dashes (-) in the URL. For example, the service name predict_prob is changed to
predict-proba in the URL.
To simplify things, in Python, list_services() API can be called on ModelVersion object:
# mv: snowflake.ml.model.ModelVersion
mv.list_services()
It outputs both public endpoint (inference_endpoint) and internal endpoint (internal_endpoint).
Authentication¶
Snowflake supports multiple authentication protocols.
Simplest of all is to use Programmatic Access Tokens (PAT)
where token can be passed simply to the request header as Authorization: Snowflake Token="your_pat_token"
Note
All authorization failures such as an incorrect token or lack of network route to the service result in a 404 error. As of today, there is no way to distinguish authentication errors from invalid URLs.
Request body (or protocol or data format)¶
Snowflake supports two types of data formats for REST requests. They are inspired by Pandas dataframe particularly because they are well known in the industry and verifiable by customers using simple Python scripts with a Pandas Dataframe.
Tip
Method-to-URL Mapping: When constructing your request URL, note that underscores (_) in your model’s method names are
automatically replaced by dashes (-). For example, if your model method is predict_proba, the endpoint URL path becomes
/predict-proba.
Here are the details about the formats
dataframe_splitis a compact, index/columns/data representation.
A representation that mirrors
pandas_df.to_json(orient="split").
dataframe_recordsis a key/value (record-oriented) representation.
A representation that mirrors
df.to_json(orient="records").
It is recommended to use dataframe_split format. Since dataframe_records repeats column names for each row, it typically
produces a larger request body than dataframe_split. This can have a performance impact for large batches or frequent calls.
Model endpoints continue to return a single output format, regardless of which input format you use.
dataframe_splitformat (Recommended)
This matches the structure produced by the Pandas “split” orientation. The request body wraps the following structure under a
dataframe_split key:
index: A list of row indices.columns: A list of column names.data: A list of rows, where each row is a list of values aligned with the columns.
Example cURL Request:
curl -X POST "<endpoint_url>" \
-H 'Authorization: Snowflake Token="<pat_token>"' \
-H 'Content-Type: application/json' \
-w "\n\n=== RESULT ===\nHTTP Status: %{http_code}\nTotal Time: %{time_total}s\nConnect Time: %{time_connect}s\nServer Processing: %{time_starttransfer}s\nResponse Size: %{size_download} bytes\nRequest Size: %{size_upload} bytes\n" \
-d '{
"dataframe_split": {
"index": [0, 1],
"columns": ["customer_id", "age", "monthly_spend"],
"data": [
[101, 32, 85.5],
[102, 45, 120.0],
]
}
}'
dataframe_recordsformat
dataframe_records matches the structure produced by Pandas records orientation:
A list of records, where each record is a dictionary mapping column names to values.
The request body wraps this list under the dataframe_records key:
Example cURL Request:
curl -X POST "<endpoint_url>" \
-H 'Authorization: Snowflake Token="<pat_token>"' \
-H 'Content-Type: application/json' \
-w "\n\n=== RESULT ===\nHTTP Status: %{http_code}\nTotal Time: %{time_total}s\nConnect Time: %{time_connect}s\nServer Processing: %{time_starttransfer}s\nResponse Size: %{size_download} bytes\nRequest Size: %{size_upload} bytes\n" \
-d '{
"dataframe_records": [
{
"customer_id": 101,
"age": 32,
"monthly_spend": 85.5,
},
{
"customer_id": 102,
"age": 45,
"monthly_spend": 120.0,
},
]
}'
Python Examples¶
dataframe_splitformat
Snowflake recommends generating the payload by using Pandas JSON serialization, and then deserializing with json.loads before
sending the request. This ensures that data types are handled consistently.
import json
import pandas as pd
import requests
# Example DataFrame
df = pd.DataFrame(
{
"customer_id": [101, 102],
"age": [32, 45],
"monthly_spend": [85.5, 120.0],
}
)
ENDPOINT_URL = "<your endpoint URL>"
HEADERS = {
"Authorization": f'Snowflake Token="{PAT}"',
"Content-Type": "application/json"
}
# Use Pandas to generate the JSON, then load it back to a Python dict
split_obj = json.loads(df.to_json(orient="split"))
payload = {
"dataframe_split": split_obj
}
response = requests.post(
ENDPOINT_URL,
headers=HEADERS,
json=payload,
timeout=30,
)
result = response.json()
Key points:
Use pd.Dataframe.to_json (eg.
df.to_json(orient="split")) to correctly handle types such as timestamps, floats, nulls, categoricals etc. which the native json serializer is unfamiliar with.json.loads(...)converts the JSON string to a Python dictionary so we can properly construct the payload.requests.post(..., json=payload)serializes the dictionary back to JSON for the HTTP request.
dataframe_recordsformat
As with dataframe_split, use Pandas JSON serialization and json.loads:
import json
import pandas as pd
import requests
df = pd.DataFrame(
{
"customer_id": [101, 102],
"age": [32, 45],
"monthly_spend": [85.5, 120.0],
}
)
ENDPOINT_URL = "<your endpoint invoke URL>"
HEADERS = {
"Authorization": "Bearer <your token>",
"Content-Type": "application/json",
}
records_obj = json.loads(df.to_json(orient="records"))
payload = {
"dataframe_records": records_obj
}
response = requests.post(
ENDPOINT_URL,
headers=HEADERS,
json=payload,
timeout=30,
)
response.raise_for_status()
result = response.json()
Next Steps¶
Explore these detailed guides to optimize and manage your inference services:
Service Management & Scaling: Learn about autoscaling, manual suspension, and hardware configuration.
Stable Endpoints & API Reference: Deep dive into the Snowflake Gateway, authentication, and data protocols (
dataframe_split).Auto-capture Inference Logs: Set up automated logging for model monitoring.
Real-World Examples & Quickstarts: See end-to-end code for XGBoost (CPU) and Hugging Face (GPU) models.
Troubleshooting: Common fixes for package conflicts, OOM errors, and build failures.