Model Serving in Snowpark Container Services¶
Note
The ability to run models in Snowpark Container Services (SPCS) described in this topic is available in
snowflake-ml-python
version 1.6.4 and later.
The Snowflake Model Registry allows you to run models either in a warehouse (the default), or in a Snowpark Container Services (SPCS) compute pool through Model Serving. Running models in a warehouse imposes a few limitations on the size and kinds of models you can use (specifically, small-to-medium size CPU-only models whose dependencies can be satisfied by packages available in the Snowflake conda channel).
Running models on Snowpark Container Services (SPCS) eases these restrictions, or eliminates them entirely. You can use any packages you want, including those from the Python Package Index (PyPI) or other sources. Large models can be run on distributed clusters of GPUs. And you don’t need to know anything about container technologies, such as Docker or Kubernetes. Snowflake Model Serving takes care of all the details.
Key concepts¶
A simplified high-level overview of the Snowflake Model Serving inference architecture is shown below.
The main components of the architecture are:
Inference server: The server that runs the model and serves predictions. The inference server can use multiple inference processes to fully utilize the node’s capabilities. Requests to the model are dispatched by admission control, which manages the incoming request queue to avoid out-of-memory conditions, rejecting clients when the server is overloaded. Today, Snowflake provides a simple and flexible Python-based inference server that can run inference for all types of models. Over time, Snowflake plans to offer inference servers optimized for specific model types.
Model-specific Python environment: To reduce the latency of starting a model, which includes the time required to download dependencies and load the model, Snowflake builds a container that encapsulates the dependencies of the specific model.
Service functions: To talk to the inference server from code running in a warehouse, Snowflake Model Serving builds functions that have the same signature as the model, but which instead call the inference server via the external function protocol.
Ingress endpoint: To allow applications outside Snowflake to call the model, Snowflake Model Serving can provision an optional HTTP endpoint, accessible to the public Internet.
How does it work?¶
The following diagram shows how the Snowflake Model Serving deploys and serves models in either a warehouse or on SPCS.
As you can see, the path to SPCS deployment is more complex than the path to warehouse deployment, but Snowflake Model Serving does all the work for you, including building the container image that holds the model and its dependencies, and creating the inference server that runs the model.
Prerequisites¶
Before you begin, make sure you have the following:
A Snowflake account in any commercial AWS region. Gov regions are not supported. Contact your account representative if your account is in Azure.
Version 1.6.4 or later of the
snowflake-ml-python
Python package.A model you want to run on Snowpark Container Services.
Familiarity with the Snowflake Model Registry.
Familiarity with the Snowpark Container Services. In particular, you should understand compute pools, image repositories, and related privileges.
Create a compute pool¶
Snowpark Container Services (SPCS) runs container images in compute pools. If you don’t already have a suitable compute pool, create one as follows:
CREATE COMPUTE POOL IF NOT EXISTS mypool
MIN_NODES = 2
MAX_NODES = 4
INSTANCE_FAMILY = 'CPU_X64_M'
AUTO_RESUME = TRUE;
See the family names table for a list of valid instance families.
Make sure the role that will run the model is the owner of the compute pool or else has the USAGE or OPERATE privilege on the pool.
Create an image repository¶
Snowflake Model Serving builds a container image that holds the model and its dependencies. To store this image, you need an image repository. If you don’t already have one, create one as follows:
CREATE IMAGE REPOSITORY IF NOT EXISTS my_inference_images
If you will be using an image repository that you do not own, make sure the role that will build the container image has the SERVICE READ, SERVICE WRITE, READ, and WRITE privileges on the repository. Grant these privileges as follows:
GRANT WRITE ON IMAGE REPOSITORY my_inference_images TO ROLE myrole;
GRANT READ ON IMAGE REPOSITORY my_inference_images TO ROLE myrole;
GRANT SERVICE WRITE ON IMAGE REPOSITORY my_inference_images TO ROLE myrole;
GRANT SERVICE READ ON IMAGE REPOSITORY my_inference_images TO ROLE myrole;
Limitations¶
While this capability is in preview, the following limitations apply. Snowflake intends to address these limitations before general availability.
Only the owner of a model can deploy it to Snowpark Container Services
Size of compute cluster does not auto-scale. You can manually alter the number of instances at runtime using
ALTER SERVICE myservice MIN_INSTANCES = n
. In some cases this causes existing nodes to fail.Scaling up services and compute pools is slower than expected. This should be improved before general availability.
Auto-suspend of a container service is not supported. If you expect sporadic usage, you might want to manually suspend the service after each use.
Image building fails if it takes more than an hour.
Table functions are not supported. Models with no regular function cannot currently be deployed to Snowpark Container Services.
Deploying a model to SPCS¶
Either log a new model version (using reg.log_model
)
or obtain a reference to an existing model version (reg.get_model(...).version()
).
In either situation, you end up with a reference to a ModelVersion
object.
Model dependencies and eligibility¶
A model’s dependencies determine if it can run in a warehouse, in a SPCS service, or both. You can, if necessary, intentionally specify dependencies to make a model ineligible to run in one of these environments.
The Snowflake conda channel is available only in warehouses and is the only source for warehouse dependencies. By default, conda dependencies for SPCS models obtain their dependencies from conda-forge.
When you log a model version, the model’s conda dependencies are validated against the Snowflake conda channel. If all the model’s conda dependencies are available there, the model is deemed eligible to run in a warehouse. It may also be eligible to run in a SPCS service if all of its dependencies are available from conda-forge, although this is not checked until you create a service.
Models logged with PyPI dependencies must be run on SPCS. Specifying at least one PyPI dependency is one way to make a model ineligible to run in a warehouse. If your model has only conda dependencies, specify at least one with an explicit channel (even conda-forge), as shown in the following example.
reg.log_model(
model_name="my_model",
version_name="v1",
model=model,
conda_dependencies=["conda-forge::scikit-learn"])
For SPCS-deployed models, conda dependencies, if any, are installed first, then any PyPI dependencies are installed in
the conda environment using pip
.
Create a service¶
To create a SPCS service and deploy the model to it, call the model version’s create_service
method, as shown in
the following example.
mv.create_service(service_name="myservice",
service_compute_pool="my_compute_pool",
image_repo="mydb.myschema.my_image_repo",
ingress_enabled=True,
gpu_requests=None)
The following the required arguments to create_service
:
service_name: The name of the service to create. This name must be unique within the account.
service_compute_pool: The name of the compute pool to use to run the model. The compute pool must already exist.
image_repo: The name of the image repository to use to store the container image. The repo must already exist and the user must have the SERVICE WRITE privilege on it (or OWNERSHIP).
ingress_enabled: If True, the service is made accessible via an HTTP endpoint. To create the endpoint, the user must have the BIND SERVICE ENDPOINT privilege.
gpu_requests: A string specifying the number of GPUs. For a model that can be run on either CPU or GPU, this argument determines whether the model will be run on the CPU or on the GPUs. If the model is of a known type that can only be run on CPU (for example, scikit-learn models), the image build fails if GPUs are requested.
This example shows only the required and most commonly used arguments. See the ModelVersion API reference for a complete list of arguments.
Default service configuration¶
By default, a CPU-powered model uses twice the number of CPUs, plus one, worker processes. GPU-powered models use one
worker process. You can override this using the num_workers
argument.
Some models are not thread-safe. Therefore, the service loads a separate copy of the model for each worker process. This can result in resource depletion for large models.
By default, the inference server optimizes for running a single inference at a time, aiming to make full use of all CPU and memory of each node.
Container image build behavior¶
By default, Snowflake Model Serving builds the container image using the same compute pool that will be used to run
the model. This inference compute pool is likely overpowered for this task (for example, GPUs are not used in building
container images). In most cases, this won’t have a significant impact on compute costs, but if it is a concern, you
can choose a less powerful compute pool for building images by specifying the image_build_compute_pool
argument.
create_service
is an idempotent function. Calling it multiple times does not trigger image building every time.
However, container images may be rebuilt based on updates in the inference service, including fixes for
vulnerabilities in dependent packages. When this happens, create_service
automatically triggers a rebuild
of the image.
Using a model deployed to SPCS¶
You can call a model’s methods using SQL, Python, or an HTTP endpoint.
SQL¶
Snowflake Model Serving creates service functions when deploying a model to SPCS. These functions serve as a bridge
from SQL to the model running in the SPCS compute pool. One service function is created for each method of the model,
and they are named like model_name_method_name
. For example, if the model has two methods named PREDICT
and EXPLAIN
and is being deployed to a service named MY_SERVICE
, the resulting service functions are
MY_SERVICE_PREDICT
and MY_SERVICE_EXPLAIN
.
Note
Service functions are contained within the service. For this reason, they have only a single point of access control, the service. You can’t have different access control privileges for different functions in a single service.
Calling a model’s service functions in SQL is done using code like the following:
SELECT MY_SERVICE_PREDICT(...) FROM ...;
Python¶
Call a service’s methods using the run
method of a model version object, including the service_name
argument
to specify the service where the method will run. For example:
service_prediction = mv.run(
test_df,
function_name="predict",
service_name="my_service")
If you do not include the service_name
argument, the model runs in a warehouse.
HTTP endpoint¶
After deploying a service with ingress enabled, a new HTTP endpoint is available for the service. You can find the endpoint using the ShOW ENDPOINTS IN SERVICE command.
SHOW ENDPOINTS IN SERVICE my_service;
Take note of the ingress_url
column, which should look like random_str-account-id.snowflakecomputing.app
.
To learn more about using this endpoint, see the SPCS tutorial Create a Snowpark Container Services service and the topic Using a service from outside Snowflake in the Developer Guide. For more information about the required data format, see Remote service input and output data formats.
See Deploying a Hugging Face sentence transformer for GPU-powered inference for an example of using a model service HTTP endpoint.
Managing services¶
Snowpark Container Services offers a SQL interface for managing services. You can use the DESCRIBE SERVICE and ALTER SERVICE commands with SPCS services created by Snowflake Model Serving just as you would for managing any other SPCS service. For example, you can:
Change MIN_INSTANCES and other properties of a service
Drop (delete) a service
Share a service to another account
Change ownership of a service (the new owner must have access to the model)
Note
If the owner of a service loses access to the underlying model for any reason, the service stops working after a restart. It will continue running until it is restarted.
To ensure reproducibility and debugability, you may not change the specification of an existing inference service. You can, however, copy the specification, customize it, and use the customized specification to create your own service to host the model. However, this method does not protect the underlying model from being deleted. It is generally best to allow Snowflake Model Serving to create services.
Suspending services¶
Once you are done using a model, or if there is no usage, it is a good practice to suspend the service to save costs. You can do this using the ALTER SERVICE command.
ALTER SERVICE my_service SUSPEND;
The service restarts automatically when it receives a request, subject to scheduling and startup delays. Scheduling delay depends on the availability of the compute pool, and startup delay depends on the size of the model.
Deleting models¶
You can manage models and model versions as usual with either the SQL interface or the Python API, with the restriction that a model or model version that is being used by a service (whether running or suspended) cannot be dropped (deleted). To drop a model or model version, drop the service first.
Examples¶
These examples assume you have already created a compute pool, an image repository, and have granted privileges as needed. See Prerequisites for details.
Deploying an XGBoost model for CPU-powered inference¶
THe following code illustrates the key steps in deploying an XGBoost model for inference in SPCS, then using the deployed model for inference. A notebook for this example is available.
from snowflake.ml.registry import registry
from snowflake.ml.utils.connection_params import SnowflakeLoginOptions
from snowflake.snowpark import Session
from snowflake.ml.modeling.xgboost import XGBRegressor
# your model training code here output of which is a trained xgb_model
# Open model registry
reg = registry.Registry(session=session, database_name='my_registry_db', schema_name='my_registry_schema')
# Log the model in Snowflake Model Registry
model_ref = reg.log_model(
model_name="my_xgb_forecasting_model",
version_name="v1",
model=xgb_model,
conda_dependencies=["scikit-learn","xgboost"],
sample_input_data=train,
comment="XGBoost model for forecasting customer demand"
)
# Deploy the model to SPCS
reg_model.create_service(
service_name="ForecastModelServicev1",
service_compute_pool="my_cpu_pool",
image_repo="my_db.data.my_images",
ingress_enabled=True)
# See all services running a model
reg_model.list_services()
# Run on SPCS
reg_model.run(input_data, function_name="predict", service_name="ForecastModelServicev1")
# Delete the service
reg_model.delete_service("ForecastModelServicev1")
Deploying a Hugging Face sentence transformer for GPU-powered inference¶
This code trains and deploys a Hugging Face sentence transformer, including an HTTP endpoint.
from snowflake.ml.registry import registry
from snowflake.ml.utils.connection_params import SnowflakeLoginOptions
from snowflake.snowpark import Session
from sentence_transformers import SentenceTransformer
session = Session.builder.configs(SnowflakeLoginOptions("connection_name")).create()
reg = registry.Registry(session=session, database_name='my_registry_db', schema_name='my_registry_schema')
# Take an example sentence transformer from HF
embed_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
# Have some sample input data
input_data = [
"This is the first sentence.",
"Here's another sentence for testing.",
"The quick brown fox jumps over the lazy dog.",
"I love coding and programming.",
"Machine learning is an exciting field.",
"Python is a popular programming language.",
"I enjoy working with data.",
"Deep learning models are powerful.",
"Natural language processing is fascinating.",
"I want to improve my NLP skills.",
]
# Log the model with pip dependencies
pip_model = reg.log_model(
embed_model,
model_name="sentence_transformer_minilm",
version_name='pip',
sample_input_data=input_data, # Needed for determining signature of the model
pip_requirements=["sentence-transformers", "torch", "transformers"], # If you want to run this model in the Warehouse, you can use conda_dependencies instead
)
# Force Snowflake to not try to check warehouse
conda_forge_model = reg.log_model(
embed_model,
model_name="sentence_transformer_minilm",
version_name='conda_forge_force',
sample_input_data=input_data,
# setting any package from conda-forge is sufficient to know that it can't be run in warehouse
conda_dependencies=["sentence-transformers", "conda-forge::pytorch", "transformers"]
)
# Deploy the model to SPCS
pip_model.create_service(
service_name="my_minilm_service",
service_compute_pool="my_gpu_pool", # Using GPU_NV_S - smallest GPU node that can run the model
image_repo="my_db.data.my_images",
ingress_enabled=True,
gpu_requests="1", # Model fits in GPU memory; only needed for GPU pool
max_instances=4, # 4 instances were able to run 10M inferences from an XS warehouse
)
# See all services running a model
pip_model.list_services()
# Run on SPCS
pip_model.run(input_data, function_name="encode", service_name="my_minilm_service")
# Delete the service
pip_model.delete_service("my_minilm_service")
Since this model has ingress enabled, you can call its HTTP endpoint as follows.
import json
from pprint import pprint
import requests
import snowflake.connector
# Generate right header
# Note that, ideally user should use key-pair authentication for API access (see this).
def initiate_snowflake_connection():
connection_parameters = SnowflakeLoginOptions("connection_name")
connection_parameters["session_parameters"] = {"PYTHON_CONNECTOR_QUERY_RESULT_FORMAT": "json"}
snowflake_conn = snowflake.connector.connect(**connection_parameters)
return snowflake_conn
def get_headers(snowflake_conn):
token = snowflake_conn._rest._token_request('ISSUE')
headers = {'Authorization': f'Snowflake Token=\"{token["data"]["sessionToken"]}\"'}
return headers
headers = get_headers(initiate_snowflake_connection())
# Put the endpoint url with method name
URL='https://<random_str>-<account>.snowflakecomputing.app/encode'
# Prepare data to be sent
data = {
'data': []
}
for idx, x in enumerate(input_data):
data['data'].append([idx, x])
# Send over HTTP
def send_request(data: dict):
output = requests.post(URL, json=data, headers=headers)
assert (output.status_code == 200), f"Failed to get response from the service. Status code: {output.status_code}"
return output.content
# Test
results = send_request(data=data)
pprint(json.loads(results))
Deploying a PyTorch model for GPU-powered inference¶
See this quickstart for an example of training and deploying a PyTorch deep learning recommendation model (DLRM) to SPCS for GPU inference.
Best practices¶
- Sharing the image repository
It is common for multiple users or roles to use the same model. Using a single image repository allows the image to be built once and reused by all users, saving time and expense. All roles that will use the repo need the SERVICE READ, SERVICE WRITE, READ, and WRITE privileges on the repo. Since the image might need to be rebuilt to update dependencies, you should keep the write privieges; don’t revoke them after the image is initially built.
- Scaling the inference service
Snowpark Container Services autoscaling is very conservative and does not scale up or down fast enough for most ML workloads. For this reason, Snowflake recommends that you set both MIN_INSTANCES and MAX_INSTANCES to the same value, choosing these values to get the performance you need for your typical workloads. Use SQL like the following:
ALTER SERVICE myservice SET MIN_INSTANCES = <new_num> MAX_INSTANCES = <new_num>;
For the same reason, when initially creating the service using the Python API, the
create_service
method accepts onlymax_instances
and uses that same value formin_instances
.- Choosing node type and number of instances
Use the smallest GPU node where the model fits into memory. Scale by increasing the number of instances, as opposed to increasing
num_workers
in a larger GPU node. For example, if the model fits in the GPU_NV_S instance type, usegpu_requests=1
and scale up by increasingmax_instances
rather than using a combination ofgpu_requests
andnum_workers
on a larger GPU instance.- Choosing warehouse size
The larger the warehouse is, the more parallel requests are sent to inference servers. Inference is an expensive operation, so use a smaller warehouse where possible. Using warehouse size larger than medium does not accelerate query performance and incurs additional cost.
- Separate schema for model deployments
Creating a service creates multiple schema-level objects (the service itself and one service function per model function). To avoid clutter, use separate schemas for storing models (Snowflake Model Registry) and deploying them (Snowflake Model Serving).
Troubleshooting¶
Monitoring SPCS Deployments¶
You can monitor deployment by inspecting the services being launched using the following SQL query.
SHOW SERVICES IN COMPUTE POOL my_compute_pool;
Two jobs are launched:
MODEL_BUILD_xxxxx: The final characters of the name are randomized to avoid name conflicts. This job builds the image and ends after the image has been built. If an image already exists, the job is skipped.
The logs are useful for debugging issues such as conflicts in package dependencies. To see the logs from this job, run the SQL below, being sure to use the same final characters:
CALL SYSTEM$GET_SERVICE_LOGS('MODEL_BUILD_xxxxx', 0, 'model-build');MYSERVICE: The name of the service as specified in the call to
create_service
. This job is started if the MODEL_BUILD job is successful or skipped. To see the logs from this job, run the SQL below:CALL SYSTEM$GET_SERVICE_LOGS('MYSERVICE', 0, 'model-inference');
Package conflicts¶
Two systems dictate the packages installed in the service container: the model itself and the inference server. To minmize conflicts with your model’s dependencies, the inference server requires only the following packages:
gunicorn<24.0.0
starlette<1.0.0
uvicorn-standard<1.0.0
Make sure your model dependencies, along with the above, are resolvable by pip
or conda
, whichever you use.
If a model has both conda_dependencies
and pip_requirements
set, these will be installed as follows via conda:
Channels:
conda-forge
nodefaults
Dependencies:
all_conda_packages
- pip:
all_pip_packages
By default, Snowflake uses conda-forge for Anaconda packages, since the Snowflake conda channel is available in
warehouses, and the defaults
channel requires users to accept Anaconda terms of use. To specify packages from the defaults channel, include the package name: defaults::pkg_name
.
Service out of memory¶
Some models are not thread-safe, so Snowflake loads a copy of the model in memory for each worker process. This can
cause out-of-memory conditions for large models with a higher number of workers. Try reducing num_workers
.
Unsatisfactory query performance¶
Usually, inference is bottlenecked by the number of instances in the inference service. Try passing a higher value for max_instances
while deploying the model.
Unable to alter the service spec¶
The specifications of the model build and inference services cannot be changed using ALTER SERVICE. You can only change attributes such as TAG, MIN_INSTANCES, and so forth.
Since the image is published in the image repo, however, you can copy the spec, modify it, and create a new service from it, which you can start manually.