Scale an application using Ray¶
The Snowflake container runtime integrates with Ray, an open-source unified framework for scaling AI and Python applications. This integration allows you to use Ray’s distributed computing capabilities on Snowflake for your machine learning workloads.
Ray is pre-installed and runs as a background process within the Snowflake ML container runtime. You can access Ray from the Container Runtime in the following ways:
Snowflake Notebooks: An interactive environment where you can connect to Ray, define tasks, and scale your cluster dynamically for development and experimentation.
Snowflake ML Jobs: Submit your Ray applications as structured, repeatable jobs. You can specify the cluster size as part of the job configuration for production workloads.
When you run the container runtime within a Snowflake Notebook or ML Job, the Ray process is automatically initiated as part of that container.
Use the following Python code to connect to the cluster:
Important
Make sure you always use the "auto" address when you’re connecting to the Ray cluster.
Initializing with the "auto" address directs your application to the head node of the Ray cluster that Snowflake has provisioned for your session.
Scaling your Ray cluster¶
After you connect to the Ray cluster, you can adjust its size to meet the computational demands of your workload.
Use the following approaches to scale your Ray cluster:
Within a notebook, you can dynamically scale your cluster up or down using the scale_cluster function. This is ideal for interactive workflows where resource needs might change.
When you specify expected_cluster_size=5, you get 1 head node and 4 worker nodes.
For ML Jobs, you define the cluster size declaratively within your job definition. Specifying the cluster size in the job definition ensures that the required number of nodes is provisioned when the job starts.
For example, your job decorator might include:
After you’ve finished using your cluster you can scale it down. For more information, see Cleaning up.
Monitoring with the Ray Dashboard¶
If you’re running a job from a Snowflake Notebook, you can use the Ray Dashboard to monitor your cluster. The dashboard is a web interface that allows you to view the cluster’s resources, jobs, tasks, and performance. Use the following code to get the dashboard URL:
Open the URL in a new browser tab, log in with your Snowflake credentials.
Advanced use cases¶
This section covers advanced Ray features for complex workloads and for migrating existing applications.
Creating and operating distributed workloads with Ray¶
Ray provides components that enable you to create and operate distributed workloads. These include foundational components via Ray Core with essential primitives for building and scaling these workloads.
It also includes the following libraries that enable you build your own workflows for data preprocessing, ML training, hyperparameter tuning, and model inference:
Ray Data: Scalable data processing and transformation
Ray Train: Distributed training and fine-tuning of ML models
Ray Tune: Hyperparameter optimization with advanced search algorithms
Ray Serve: Model serving and inference
The following sections describe how you can use these libraries directly, while native Snowflake interfaces built over Ray provide additional tools to build, deploy, and operationalize Ray-based applications.
Ray Core: Tasks and Actors¶
Ray provides the following distributed computing primitives:
Tasks: Stateless functions that run remotely and return values
Actors: Stateful classes that can be instantiated remotely and called multiple times
Objects: Immutable values stored in Ray’s distributed object store
Resources: CPU, GPU, and custom resource requirements for tasks and actors
The following example demonstrates how to use a basic Ray Task and Actors to do linear regression:
Ray Train: Distributed Training¶
Ray Train is a library that enables distributed training and fine-tuning of models. You can run your training code on a single machine or an entire cluster.
You can use Ray Train for both single-node and multi-node execution.
For multi-node training, you must handle the following:
Distributed storage for checkpoints (no shared filesystem across nodes)
Custom data loading
Manual resource configuration to coordinate between data ingestion and training resource usage
For a streamlined experience, use the Optimized Training functions for XGBoost, LightGBM, and PyTorch. On the same Ray cluster, these functions handle:
Snowflake stage-based checkpointing
Native Snowflake data ingestion
Built-in resource allocation for data ingestion and training
Ray Data: Scalable Data Processing¶
Ray Data provides scalable, distributed data processing for ML workloads. It can handle datasets larger than cluster memory through streaming execution and lazy evaluation.
Note
Snowflake offers a native integration to transform any Snowflake data source to Ray Data. For more information, see the Data Connector and Ray Data Ingestion pages.
Use Ray Data for:
Processing large datasets that don’t fit in single-node memory
Distributed data preprocessing and feature engineering
Building data pipelines that integrate with other Ray libraries
Ray Tune: Distributed Hyperparameter Tuning¶
Ray Tune provides distributed hyperparameter optimization with advanced search algorithms and early stopping capabilities. For a more integrated and optimized experience when reading from Snowflake data sources, use the native Hyperparameter Optimization (HPO) API. For more information about using HPO optimization, see Optimize a model’s hyperparameters.
If you’re looking for a more customizable approach to a distributed HPO implementation, use Ray Tune.
You can use Ray Tune for the following use cases:
Hyperparameter optimization across multiple trials in parallel
Advanced search algorithms (Bayesian optimization, population-based training)
Large-scale hyperparameter sweeps requiring distributed execution
Model Serving¶
For model serving, you can use Snowflake’s native capabilities. For more information, see Deploy models for Real time Inference (REST API).
Submit and manage distributed applications on Ray clusters¶
Use Ray Jobs to submit and manage distributed applications on Ray clusters with better resource isolation and lifecycle management. For all job-based executions that require access to a Ray Cluster, Snowflake recommends using an ML Job, where you can define the Ray application logic. For instances where you require direct access to the Ray Job interface, such as migrating an existing implementation, you could use the Ray Job primitive as is described in the Ray documentation.
Use Ray jobs for:
Production ML pipelines and scheduled workflows
Long-running workloads requiring fault tolerance
Batch processing and large-scale data processing
Scaling Ray Clusters with Options¶
From a Snowflake Notebook, you can scale your Ray clusters to precisely match computational demands. A cluster consists of a head node (coordinator) and worker nodes (for task execution).
Resource monitoring¶
Cleaning up¶
After you’re finished with the cluster, you can scale it down to avoid additional charges. Use the following code to scale it down: