Snowflake Multi-Node ML Jobs¶
Use Snowflake Multi-Node ML Jobs to run distributed machine learning (ML) workflows inside Snowflake ML container runtimes across multiple compute nodes. Distribute work across multiple nodes to process large datasets and complex models with improved performance. For information about Snowflake ML Jobs, see Snowflake ML Jobs.
Snowflake Multi-Node ML Jobs extend Snowflake ML Job capabilities by enabling distributed execution across multiple nodes. This brings you:
Scalable Performance: Horizontally scale to process datasets too large to fit on a single node
Reduced Training Time: Speed up complex model training through parallelization
Resource Efficiency: Optimize resource utilization for data-intensive workloads
Framework Integration: Seamlessly use distributed frameworks like Distributed Modeling Classes and Ray.
When you run a Snowflake ML Job with multiple nodes, the following occurs:
One node serves as the head node (coordinator)
Additional nodes serve as worker nodes (compute resources)
Together, the nodes form a single logical ML job entity in Snowflake
A single-node ML Job only has a head node. A multi-node job with three active nodes has one head node and two worker nodes. All three nodes participate in running your workload.
Prerequisites¶
The following prerequisites are required to use Snowflake Multi-Node ML Jobs.
To set up multi-node jobs, do the following:
Install the Snowflake ML Python package.
Create a compute pool with enough nodes to support your multi-node job:
Important
You must set MAX_NODES to be greater than or equal to the number of target instances that you’re using to run your training job. If you request more nodes than you intend to use for your training job, it might fail or behave unpredictably. For information about running a training job, see Running multi-node ML jobs.
Writing code for multi-node jobs¶
For multi-node jobs, your code needs to be designed for distributed processing using Distributed Modeling Classes or Ray.
The following are key patterns and considerations when you use distributed modeling classes or Ray:
Understanding node initialization and availability¶
In multi-node jobs, worker nodes can initialize asynchronously and at different times:
Nodes might not all start simultaneously, especially if compute pool resources are limited
Some nodes might start seconds or even minutes after others
ML Jobs automatically wait for the specified
target_instancesto be available before executing your payload. The job fails with an error if the expected nodes aren’t available within the timeout period. For more information on customizing this behavior, see Advanced Configuration: Using min_instances.
You can check available nodes in your job through Ray:
Distributed Processing Patterns¶
There are multiple patterns you can apply in the payload body of the multi-node job for distributed processing. These patterns leverage Distributed Modeling Classes and Ray:
Using Snowflake’s Distributed Training API¶
Snowflake provides optimized trainers for common ML frameworks:
For more information about the available APIs, see Distributed Modeling Classes .
Using Native Ray Tasks¶
Another approach is to use Ray’s task-based programming model:
For more information, see Ray’s task programming documentation.
Running multi-node ML jobs¶
You can run multi-node ML jobs using the same methods as single-node jobs, using the target_instances parameter:
Using the Remote Decorator¶
Running a Python File¶
Running a Directory¶
Advanced Configuration: Using min_instances¶
For more flexible resource management, you can use the optional min_instances parameter to specify a minimum number of instances required for the job to proceed.
If min_instances is set, the job payload is executed as soon as the minimum number of nodes becomes available, even if that number is smaller than target_instances.
This is useful when you want to:
Start training with fewer nodes if the full target isn’t immediately available
Reduce wait times when compute pool resources are limited
Implement fault-tolerant workflows that can adapt to varying resource availability
Managing Multi-Node Jobs¶
Monitoring Job Status¶
Job status monitoring is unchanged from single node jobs:
Accessing Logs by Node¶
In multi-node jobs, you can access logs from specific instances:
Common Issues and Limitations¶
Use the following information to address common issues that you might encounter.
Node Connection Failures: If worker nodes fail to connect to the head node, it’s possible that the head node completes its task and then turns itself down before the worker finishes its job. To avoid connection failures, implement result collection logic in the job.
Memory Exhaustion: If jobs fail due to memory issues, increase the node size or use more nodes with less data per node.
Node Availability Timeout: If the required number of instances (either
target_instancesormin_instances) are not available within the predefined timeout, the job will fail. Ensure your compute pool has sufficient capacity or adjust your instance requirements.