Model Serving in Snowpark Container Services

Note

The ability to run models in Snowpark Container Services (SPCS) described in this topic is available in snowflake-ml-python version 1.6.4 and later.

The Snowflake Model Registry allows you to run models either in a warehouse (the default), or in a Snowpark Container Services (SPCS) compute pool through Model Serving. Running models in a warehouse imposes a few limitations on the size and kinds of models you can use (specifically, small-to-medium size CPU-only models whose dependencies can be satisfied by packages available in the Snowflake conda channel).

Running models on Snowpark Container Services (SPCS) eases these restrictions, or eliminates them entirely. You can use any packages you want, including those from the Python Package Index (PyPI) or other sources. Large models can be run on distributed clusters of GPUs. And you don’t need to know anything about container technologies, such as Docker or Kubernetes. Snowflake Model Serving takes care of all the details.

Key concepts

A simplified high-level overview of the Snowflake Model Serving inference architecture is shown below.

Model inference on Snowpark Container Services architecture

The main components of the architecture are:

  • Inference server: The server that runs the model and serves predictions. The inference server can use multiple inference processes to fully utilize the node’s capabilities. Requests to the model are dispatched by admission control, which manages the incoming request queue to avoid out-of-memory conditions, rejecting clients when the server is overloaded. Today, Snowflake provides a simple and flexible Python-based inference server that can run inference for all types of models. Over time, Snowflake plans to offer inference servers optimized for specific model types.

  • Model-specific Python environment: To reduce the latency of starting a model, which includes the time required to download dependencies and load the model, Snowflake builds a container that encapsulates the dependencies of the specific model. This may require an external access integration to allow the container build process to download the necessary dependencies using pip or conda.

    Note

    External access integrations are required only when dependencies need to be downloaded from an external repository, such as conda-forge or PyPI. Snowflake intends to remove this requirement in a future release.

  • Service functions: To talk to the inference server from code running in a warehouse, Snowflake Model Serving builds functions that have the same signature as the model, but which instead call the inference server via the external function protocol.

  • Ingress endpoint: To allow applications outside Snowflake to call the model, Snowflake Model Serving can provision an optional HTTP endpoint, accessible to the public Internet.

How does it work?

The following diagram shows how the Snowflake Model Serving deploys and serves models in either a warehouse or on SPCS.

Model deployment on Snowpark Container Services

As you can see, the path to SPCS deployment is more complex than the path to warehouse deployment, but Snowflake Model Serving does all the work for you, including building the container image that holds the model and its dependencies, and creating the inference server that runs the model.

Prerequisites

Before you begin, make sure you have the following:

  • A Snowflake account in any commercial AWS region. Gov regions are not supported. Contact your account representative if your account is in Azure.

  • Version 1.6.4 or later of the snowflake-ml-python Python package.

  • A model you want to run on Snowpark Container Services.

  • Familiarity with the Snowflake Model Registry.

  • Familiarity with the Snowpark Container Services. In particular, you should understand compute pools, image repositories, and related privileges.

Create a compute pool

Snowpark Container Services (SPCS) runs container images in compute pools. If you don’t already have a suitable compute pool, create one as follows:

CREATE COMPUTE POOL IF NOT EXISTS mypool
    MIN_NODES = 2
    MAX_NODES = 4
    INSTANCE_FAMILY = 'CPU_X64_M'
    AUTO_RESUME = TRUE;
Copy

See the family names table for a list of valid instance families.

Make sure the role that will run the model is the owner of the compute pool or else has the USAGE or OPERATE privilege on the pool.

Create an image repository

Snowflake Model Serving builds a container image that holds the model and its dependencies. To store this image, you need an image repository. If you don’t already have one, create one as follows:

CREATE IMAGE REPOSITORY IF NOT EXISTS my_inference_images
Copy

If you will be using an image repository that you do not own, make sure the role that will build the container image has the SERVICE READ, SERVICE WRITE, READ, and WRITE privileges on the repository. Grant these privileges as follows:

GRANT WRITE ON IMAGE REPOSITORY my_inference_images TO ROLE myrole;
GRANT READ ON IMAGE REPOSITORY my_inference_images TO ROLE myrole;
GRANT SERVICE WRITE ON IMAGE REPOSITORY my_inference_images TO ROLE myrole;
GRANT SERVICE READ ON IMAGE REPOSITORY my_inference_images TO ROLE myrole;
Copy

Create an external access integration

The container build process needs access to various Internet sites to download dependencies from conda-forge, PyPI, or other repositories or sites. These must be set up as external access integrations (EAIs) by the ACCOUNTADMIN role.

Note

External access integrations are required only when dependencies need to be downloaded from an external repository, such as conda-forge or PyPI. Snowflake intends to remove this requirement in a future release.

External access integrations are account-level objects and cannot be shared.

First, create the necessary network rules. Access to conda-forge is always needed; if you don’t need access to any other package repositories, create this rule.

CREATE OR REPLACE NETWORK RULE conda_forge_rule
    MODE = 'EGRESS'
    TYPE = 'HOST_PORT'
    VALUE_LIST = ('conda.anaconda.org:443')
Copy

Note

You cannot use the snowflake conda channel with Snowpark Container Services. All conda packages are installed from conda-forge when building an SPCS container image.

If you need to install packages from PyPI, also create the following rule. All four of the listed hosts are necessary for pip to work.

CREATE OR REPLACE NETWORK RULE pypi_rule
    MODE = 'EGRESS'
    TYPE = 'HOST_PORT'
    VALUE_LIST = ('pypi.org:443', 'pypi.python.org:443', 'pythonhosted.org:443',
                  'files.pythonhosted.org:443');
Copy

If you need access to many other sites, you can create a rule that allows broad access to the Internet. There is little security risk involved as long as this rule applies only to the role used to build container images.

CREATE OR REPLACE NETWORK RULE all_access_rule
    MODE = 'EGRESS'
    TYPE = 'HOST_PORT'
    VALUE_LIST = ('0.0.0.0:443', '0.0.0.0:80')
Copy

Create the external access integration using one or more of your rules. In the example below, we use both the conda_forge_rule and pypi_rule defined earlier, allowing access only to conda-forge and PyPI.

CREATE OR REPLACE EXTERNAL ACCESS INTEGRATION model_service_build_access
    ALLOWED_NETWORK_RULES = (conda_forge_rule, pypi_rule)
    ENABLED = true;
Copy

Finally, grant USAGE on the EAI to the role that will build container images.

GRANT USAGE ON INTEGRATION model_service_build_access TO ROLE model_users;
Copy

Limitations

While this capability is in preview, the following limitations apply. Snowflake intends to address these limitations before general availability.

  • Only the owner of a model can deploy it to Snowpark Container Services

  • Size of compute cluster does not auto-scale. You can manually alter the number of instances at runtime using ALTER SERVICE myservice MIN_INSTANCES = n. In some cases this causes existing nodes to fail.

  • Scaling up services and compute pools is slower than expected. This should be improved before general availability.

  • Auto-suspend of a container service is not supported. If you expect sporadic usage, you might want to manually suspend the service after each use.

  • The create_service Python method does not allow specifying multiple external access integrations (EAIs). If a model needs both conda and pip, create a single EAI that allows both of them. (Note that you can create an EAI with multiple network rules.)

  • Image building fails if it takes more than an hour.

  • Table functions are not supported. Models with no regular function cannot currently be deployed to Snowpark Container Services.

Deploying a model to SPCS

You can either log a new model version (using registry.log_model) or obtain a reference to an existing model version (registry.get_model(...).version()). In either situation, you end up with a reference to a ModelVersion object.

Note

Models logged with PyPI dependencies must be run in SPCS; they cannot be run in a warehouse. Conversely, the Snowflake conda channel is available only in warehouses; conda packages are installed only from conda-forge for SPCS models. If a SPCS model has PyPI dependencies, they are installed with pip in the conda environment.

To deploy the model version to SPCS, call the model version’s create_service method, as shown here.

mv.create_service(service_name="myservice",
              service_compute_pool="my_compute_pool",
              image_repo="mydb.myschema.my_image_repo",
              build_external_access_integration="my_external_access",
              ingress_enabled=True,
              gpu_requests=None)
Copy

The following the required arguments to create_service:

  • service_name: The name of the service to create. This name must be unique within the account.

  • service_compute_pool: The name of the compute pool to use to run the model. The compute pool must already exist.

  • image_repo: The name of the image repository to use to store the container image. The repo must already exist and the user must have the SERVICE WRITE privilege on it (or OWNERSHIP).

  • build_external_access_integration: The name of the external access integration to use when downloading dependencies. This EA should always allow access to conda-forge and should also include PyPI hosts if any dependencies are installed with pip.

  • ingress_enabled: If True, the service is made accessible via an HTTP endpoint. To create the endpoint, the user must have the BIND SERVICE ENDPOINT privilege.

  • gpu_requests: A string specifying the number of GPUs. For a model that can be run on either CPU or GPU, this argument determines whether the model will be run on the CPU or on the GPUs. If the model is of a known type that can only be run on CPU (for example, scikit-learn models), the image build fails if GPUs are requested.

This example shows only the required and most commonly used arguments. See the ModelVersion API reference for a complete list of arguments.

Default service configuration

By default, a CPU-powered model uses twice the number CPUs, plus one, worker processes. GPU-powered models use one worker process. You can override this using the num_workers argument.

Some models are not thread-safe. Therefore, the service loads a separate copy of the model for each worker process. This can result in resource depletion for large models.

By default, the inference server optimizes for running a single inference at a time, aiming to make full use of all CPU and memory of each node.

Container image build behavior

By default, Snowflake Model Serving builds the container image using the same compute pool that will be used to run the model. This inference compute pool is likely overpowered for this task (for example, GPUs are not used in building container images). In most cases, this won’t have a significant impact on compute costs, but if it is a concern, you can choose a less powerful compute pool for building images by specifying the image_build_compute_pool argument.

create_service is an idempotent function. Calling it multiple times does not trigger image building every time. However, container images may be rebuilt based on updates in the inference service, including fixes for vulnerabilities in dependent packages. When this happens, create_service automatically triggers a rebuild of the image. Therefore, you must not disable external access integration after the first call to create_service.

Using a model deployed to SPCS

You can call a model’s methods using SQL, Python, or an HTTP endpoint.

SQL

Snowflake Model Serving creates service functions when deploying a model to SPCS. These functions serve as a bridge from SQL to the model running in the SPCS compute pool. One service function is created for each method of the model, and they are named like model_name_method_name. For example, if the model has two methods named PREDICT and EXPLAIN and is being deployed to a service named MY_SERVICE, the resulting service functions are MY_SERVICE_PREDICT and MY_SERVICE_EXPLAIN.

Note

Service functions are contained within the service. For this reason, they have only a single point of access control, the service. You can’t have different access control privileges for different functions in a single service.

Calling a model’s service functions in SQL is done using code like the following:

SELECT MY_SERVICE_PREDICT(...) FROM ...;
Copy

Python

Call a service’s methods using the run method of a model version object, including the service_name argument to specify the service where the method will run. For example:

service_prediction = mv.run(
    test_df,
    function_name="predict",
    service_name="my_service")
Copy

If you do not include the service_name argument, the model runs in a warehouse.

HTTP endpoint

After deploying a service with ingress enabled, a new HTTP endpoint is available for the service. You can find the endpoint using the ShOW ENDPOINTS IN SERVICE command.

SHOW ENDPOINTS IN SERVICE my_service;
Copy

Take note of the ingress_url column, which should look like random_str-account-id.snowflakecomputing.app.

To learn more about using this endpoint, see the SPCS tutorial Create a Snowpark Container Services service and the topic Using a service from outside Snowflake in the Developer Guide. For more information about the required data format, see Remote service input and output data formats.

See Deploying a Hugging Face sentence transformer for GPU-powered inference for an example of using a model service HTTP endpoint.

Managing services

Snowpark Container Services offers a SQL interface for managing services. You can use the DESCRIBE SERVICE and ALTER SERVICE commands with SPCS services created by Snowflake Model Serving just as you would for managing any other SPCS service. For example, you can:

  • Change MIN_INSTANCES and other properties of a service

  • Drop (delete) a service

  • Share a service to another account

  • Change ownership of a service (the new owner must have access to the model)

Note

If the owner of a service loses access to the underlying model for any reason, the service stops working after a restart. It will continue running until it is restarted.

To ensure reproducibility and debugability, you may not change the specification of an existing inference service. You can, however, copy the specification, customize it, and use the customized specification to create your own service to host the model. However, this method does not protect the underlying model from being deleted. It is generally best to allow Snowflake Model Serving to create services.

Suspending services

Once you are done using a model, or if there is no usage, it is a good practice to suspend the service to save costs. You can do this using the ALTER SERVICE command.

ALTER SERVICE my_service SUSPEND;
Copy

The service restarts automatically when it receives a request, subject to scheduling and startup delays. Scheduling delay depends on the availability of the compute pool, and startup delay depends on the size of the model.

Deleting models

You can manage models and model versions as usual with either the SQL interface or the Python API, with the restriction that a model or model version that is being used by a service (whether running or suspended) cannot be dropped (deleted). To drop a model or model version, drop the service first.

Examples

These examples assume you have already created a compute pool, an image repository, and an external access integration, and have granted privileges as needed. See Prerequisites for details.

Deploying an XGBoost model for CPU-powered inference

THe following code illustrates the key steps in deploying an XGBoost model for inference in SPCS, then using the deployed model for inference. A notebook for this example is available.

from snowflake.ml.registry import registry
from snowflake.ml.utils.connection_params import SnowflakeLoginOptions
from snowflake.snowpark import Session
from snowflake.ml.modeling.xgboost import XGBRegressor

# your model training code here output of which is a trained xgb_model

# Open model registry
reg = registry.Registry(session=session, database_name='my_registry_db', schema_name='my_registry_schema')

# Log the model in Snowflake Model Registry
model_ref = reg.log_model(
    model_name="my_xgb_forecasting_model",
    version_name="v1",
    model=xgb_model,
    conda_dependencies=["scikit-learn","xgboost"],
    sample_input_data=train,
    comment="XGBoost model for forecasting customer demand"
)

# Deploy the model to SPCS
reg_model.create_service(
    service_name="ForecastModelServicev1",
    service_compute_pool="my_cpu_pool",
    image_repo="my_db.data.my_images",
    build_external_access_integration="my_egress_access_integration_for_conda_pip",
    ingress_enabled=True)

# See all services running a model
reg_model.list_services()

# Run on SPCS
reg_model.run(input_data, function_name="predict", service_name="ForecastModelServicev1")

# Delete the service
reg_model.delete_service("ForecastModelServicev1")
Copy

Deploying a Hugging Face sentence transformer for GPU-powered inference

This code trains and deploys a Hugging Face sentence transformer, including an HTTP endpoint.

from snowflake.ml.registry import registry
from snowflake.ml.utils.connection_params import SnowflakeLoginOptions
from snowflake.snowpark import Session
from sentence_transformers import SentenceTransformer

session = Session.builder.configs(SnowflakeLoginOptions("connection_name")).create()
reg = registry.Registry(session=session, database_name='my_registry_db', schema_name='my_registry_schema')

# Take an example sentence transformer from HF
embed_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

# Have some sample input data
input_data = [
    "This is the first sentence.",
    "Here's another sentence for testing.",
    "The quick brown fox jumps over the lazy dog.",
    "I love coding and programming.",
    "Machine learning is an exciting field.",
    "Python is a popular programming language.",
    "I enjoy working with data.",
    "Deep learning models are powerful.",
    "Natural language processing is fascinating.",
    "I want to improve my NLP skills.",
]

# Log the model with pip dependencies
pip_model = reg.log_model(
    embed_model,
    model_name="sentence_transformer_minilm",
    version_name='pip',
    sample_input_data=input_data,  # Needed for determining signature of the model
pip_requirements=["sentence-transformers", "torch", "transformers"], # If you want to run this model in the Warehouse, you can use conda_dependencies instead
)

# Force Snowflake to not try to check warehouse
conda_forge_model = reg.log_model(
    embed_model,
    model_name="sentence_transformer_minilm",
    version_name='conda_forge_force',
    sample_input_data=input_data,
    # setting any package from conda-forge is sufficient to know that it can't be run in warehouse
    conda_dependencies=["sentence-transformers", "conda-forge::pytorch", "transformers"]
)

# Deploy the model to SPCS
pip_model.create_service(
    service_name="my_minilm_service",
    service_compute_pool="my_gpu_pool",  # Using GPU_NV_S - smallest GPU node that can run the model
    image_repo="my_db.data.my_images",
    ingress_enabled=True,
    build_external_access_integration="my_egress_access_integration_for_conda_pip",
    gpu_requests="1", # Model fits in GPU memory; only needed for GPU pool
    max_instances=4, # 4 instances were able to run 10M inferences from an XS warehouse
)

# See all services running a model
pip_model.list_services()

# Run on SPCS
pip_model.run(input_data, function_name="encode", service_name="my_minilm_service")

# Delete the service
pip_model.delete_service("my_minilm_service")
Copy

Since this model has ingress enabled, you can call its HTTP endpoint as follows.

import json
from pprint import pprint
import requests
import snowflake.connector

# Generate right header
# Note that, ideally user should use key-pair authentication for API access (see this).
def initiate_snowflake_connection():
    connection_parameters = SnowflakeLoginOptions("connection_name")
    connection_parameters["session_parameters"] = {"PYTHON_CONNECTOR_QUERY_RESULT_FORMAT": "json"}
    snowflake_conn = snowflake.connector.connect(**connection_parameters)
    return snowflake_conn

def get_headers(snowflake_conn):
    token = snowflake_conn._rest._token_request('ISSUE')
    headers = {'Authorization': f'Snowflake Token=\"{token["data"]["sessionToken"]}\"'}
    return headers

headers = get_headers(initiate_snowflake_connection())

# Put the endpoint url with method name
URL='https://<random_str>-<account>.snowflakecomputing.app/encode'

# Prepare data to be sent
data = {
    'data': []
}
for idx, x in enumerate(input_data):
    data['data'].append([idx, x])

# Send over HTTP
def send_request(data: dict):
    output = requests.post(URL, json=data, headers=headers)
    assert (output.status_code == 200), f"Failed to get response from the service. Status code: {output.status_code}"
    return output.content

# Test
results = send_request(data=data)
pprint(json.loads(results))
Copy

Deploying a PyTorch model for GPU-powered inference

See this quickstart for an example of training and deploying a PyTorch deep learning recommendation model (DLRM) to SPCS for GPU inference.

Best practices

Sharing the image repository

It is common for multiple users or roles to use the same model. Using a single image repository allows the image to be built once and reused by all users, saving time and expense. All roles that will use the repo need the SERVICE READ, SERVICE WRITE, READ, and WRITE privileges on the repo. Since the image might need to be rebuilt to update dependencies, you should keep the write privieges; don’t revoke them after the image is initially built.

Scaling the inference service

Snowpark Container Services autoscaling is very conservative and does not scale up or down fast enough for most ML workloads. For this reason, Snowflake recommends that you set both MIN_INSTANCES and MAX_INSTANCES to the same value, choosing these values to get the performance you need for your typical workloads. Use SQL like the following:

ALTER SERVICE myservice
    SET MIN_INSTANCES = <new_num>
        MAX_INSTANCES = <new_num>;
Copy

For the same reason, when initially creating the service using the Python API, the create_service method accepts only max_instances and uses that same value for min_instances.

Choosing node type and number of instances

Use the smallest GPU node where the model fits into memory. Scale by increasing the number of instances, as opposed to increasing num_workers in a larger GPU node. For example, if the model fits in the GPU_NV_S instance type, use gpu_requests=1 and scale up by increasing max_instances rather than using a combination of gpu_requests and num_workers on a larger GPU instance.

Choosing warehouse size

The larger the warehouse is, the more parallel requests are sent to inference servers. Inference is an expensive operation, so use a smaller warehouse where possible. Using warehouse size larger than medium does not accelerate query performance and incurs additional cost.

Separate schema for model deployments

Creating a service creates multiple schema-level objects (the service itself and one service function per model function). To avoid clutter, use separate schemas for storing models (Snowflake Model Registry) and deploying them (Snowflake Model Serving).

Troubleshooting

Monitoring SPCS Deployments

You can monitor deployment by inspecting the services being launched using the following SQL query.

SHOW SERVICES IN COMPUTE POOL my_compute_pool;
Copy

Two jobs are launched:

  • MODEL_BUILD_xxxxx: The final characters of the name are randomized to avoid name conflicts. This job builds the image and ends after the image has been built. If an image already exists, the job is skipped.

    The logs are useful for debugging conflicts in package dependencies or external access integration issues that are preventing access to package repositories, among other potential issues. To see the logs from this job, run the SQL below, being sure to use the same final characters:

    CALL SYSTEM$GET_SERVICE_LOGS('MODEL_BUILD_xxxxx', 0, 'model-build');
    
    Copy
  • MYSERVICE: The name of the service as specified in the call to create_service. This job is started if the MODEL_BUILD job is successful or skipped. To see the logs from this job, run the SQL below:

    CALL SYSTEM$GET_SERVICE_LOGS('MYSERVICE', 0, 'model-inference');
    
    Copy

Package conflicts

Two systems dictate the packages installed in the service container: the model itself and the inference server. To minmize conflicts with your model’s dependencies, the inference server requires only the following packages:

  • gunicorn<24.0.0

  • starlette<1.0.0

  • uvicorn-standard<1.0.0

Make sure your model dependencies, along with the above, are resolvable by pip or conda, whichever you use.

If a model has both conda_dependencies and pip_requirements set, these will be installed as follows via conda:

Channels:

  • conda-forge

  • nodefaults

Dependencies:

  • all_conda_packages

  • pip:
    • all_pip_packages

By default, Snowflake uses conda-forge for Anaconda packages, since the Snowflake conda channel is available in warehouses, and the defaults channel requires users to accept Anaconda terms of use. To specify packages from the defaults channel, include the package name: defaults::pkg_name.

Service out of memory

Some models are not thread-safe, so Snowflake loads a copy of the model in memory for each worker process. This can cause out-of-memory conditions for large models with a higher number of workers. Try reducing num_workers.

Unsatisfactory query performance

Usually, inference is bottlenecked by the number of instances in the inference service. Try passing a higher value for max_instances while deploying the model.

Unable to alter the service spec

The specifications of the model build and inference services cannot be changed using ALTER SERVICE. You can only change attributes such as TAG, MIN_INSTANCES, and so forth.

Since the image is published in the image repo, however, you can copy the spec, modify it, and create a new service from it, which you can start manually.