Service Management & Scaling¶
Once a model is deployed to Snowpark Container Services (SPCS), you must manage its lifecycle, resource consumption, and reliability. This page covers standard operations, observability, and configuring high availability for production workloads.
Managing services¶
Snowpark Container Services offers a SQL interface for managing services. You can use the DESCRIBE SERVICE and ALTER SERVICE commands with SPCS services created by Snowflake Model Serving just as you would for managing any other SPCS service. For example, you can:
Change MIN_INSTANCES and other properties of a service
Drop (delete) a service
Share a service to another account
Change ownership of a service (the new owner must have READ access to the model)
Note
If the owner of a service loses access to the underlying model for any reason, the service stops working after a restart. It will continue running until it is restarted.
To ensure reproducibility and debugability, you cannot change the specification of an existing inference service. You can, however, copy the specification, customize it, and use the customized specification to create your own service to host the model. However, this method does not protect the underlying model from being deleted. Furthermore, it does not track lineage. It is best to allow Snowflake Model Serving to create services.
Scaling services¶
Note
Starting with snowflake-ml-python 1.25.0, you can define the scaling boundaries for your inference service by setting min_instances and max_instances within the create_service method.
How Autoscaling Works¶
The service initializes with the number of nodes specified in min_instances and dynamically scales within your defined range based on real-time traffic volume and hardware utilization.
Scale-to-Zero (Auto-Suspend): If min_instances is set to 0 (the default), the service will automatically suspend if no traffic is detected for 30 minutes.
Scaling Latency: Scaling triggers typically activate after one minute of meeting the required condition. Note that total spin-up time includes this trigger period plus the time required to provision and initialize new service instances.
Configuration Best Practices¶
Parameter |
Recommended Strategy |
|---|---|
min_instances |
Set to 1 or more for production workloads to ensure immediate availability and avoid cold-start delays. |
max_instances |
Set to accommodate peak demand while maintaining a ceiling on resource consumption and cost. |
Suspending services¶
The default min_instances=0 setting allows the service to auto-suspend after 30 minutes of inactivity. Incoming requests will trigger a resume, with the total delay determined by compute pool availability and the model’s loading time (startup delay).
To manually suspend or resume a service, use the ALTER SERVICE command.
Deleting models¶
You can manage models and model versions as usual with either the SQL interface or the Python API, with the restriction that a model or model version that is being used by a service (whether running or suspended) cannot be dropped (deleted). To drop a model or model version, drop the service first.
Monitoring services¶
When running models in Snowpark Container Services, you can monitor service health and troubleshoot issues by accessing container logs and metrics. Model serving services generate logs that can help you understand service behavior, diagnose errors, and optimize performance.
For comprehensive information about monitoring SPCS services, including accessing metrics and logs, see Monitoring services.
In Snowsight¶
You can monitor model serving services in Snowsight:
In the navigation menu, select Monitoring » Services & jobs.
On the Services tab, select your service to view the service details page.
The Overview tab displays service information including the compute pool, endpoints, and instance count.
The Logs, Metrics, and Events tabs provide logs, performance metrics, and service events (such as instance provisioning and shutdowns). Filter results by instance and container name as needed.
Accessing service logs¶
You can access logs for your model serving services using any of the following methods:
Using the service helper function¶
Model serving include a built-in helper function that retrieves logs from the event table for running or suspended services:
Querying the event table directly¶
If you have an event table configured for your account, you can query it directly to retrieve service logs:
Using the system function (Running instances only)¶
For real-time debugging of active containers you can use the SYSTEM$GET_SERVICE_LOGS function:
Note
The container name for model inference services is model-inference. For troubleshooting image build issues, use model-build as the container name.
Accessing service metrics¶
Model serving services emit performance and health metrics that can help you monitor resource utilization, request rates, latency, and other operational characteristics. These metrics are captured in the event table and can be queried to analyze service performance over time.
For more information about SPCS service metrics, see Accessing event table service metrics.
Using the service helper function¶
Model serving services include a built-in helper function that retrieves metrics from the event table for running or suspended services:
Querying the event table directly¶
You can query the event table directly to retrieve and filter specific metrics:
Fault tolerance¶
In any distributed system, failures happen. For mission-critical workloads it is on users to configure the service to be resilient against node and zonal failures.
Node Failure Resilience¶
To tolerate standard node failures, Snowflake recommends over-provisioning by 50% or maintaining a minimum of 3 instances (whichever is higher).
Example: If you need 4 instances to support peak traffic, you should provision 6 instances
Zonal Failure Resilience¶
For mission-critical workloads that require resilience against a full zonal failure, you can use a distributed compute pool when creating a service. Distributed compute pools are created with the PLACEMENT_GROUP parameter set to DISTRIBUTED. For more information about distributed compute pools, see Compute pool placement.
Configuration Guide¶
Convert an Existing Pool¶
Warning
You cannot change this setting on an active pool. You must suspend it first.
Revert an Existing Pool¶
Warning
You cannot change this setting on an active pool. You must suspend it first.
Verification¶
To confirm your pool is correctly configured for HA, check the placement_group column: