Apr 24, 2025: Container Runtime for ML on multi-node clusters (Preview)¶
Snowflake announces the preview of Container Runtime for ML on multi-node clusters, a new capability that allows you to scale your ML workloads across multiple compute nodes in Snowflake Notebooks.
Container Runtime for ML on multi-node clusters enables you to:
- Scale ML workloads: Dynamically adjust the number of nodes in your compute pool to match the resource needs of your ML tasks. 
- Run distributed training: Train ML models on larger datasets using distributed frameworks like PyTorch, LightGBM, and XGBoost. 
- Manage cluster resources: Easily scale up for resource-intensive tasks and scale down when fewer resources are needed. 
- Control scaling operations: Configure asynchronous scaling, timeout thresholds, and minimum node requirements to match your workflow needs. 
Key benefits of Container Runtime for ML on multi-node clusters include:
- Improved performance: Process larger datasets and accelerate training of complex models through parallelization. 
- Resource efficiency: Scale resources up or down based on workload requirements without provisioning new compute pools. 
- Flexibility: Support for synchronous or asynchronous scaling operations to match your development workflow. 
- Simplicity: Straightforward APIs for scaling clusters and monitoring active nodes with minimal configuration. 
To get started with Container Runtime for ML on multi-node clusters, see Container Runtime for ML on multi-node clusters.