April 22, 2025 — Container Runtime for ML on multi-node clusters — Preview¶
Snowflake announces the preview of Container Runtime for ML on multi-node clusters, a new capability that allows you to scale your ML workloads across multiple compute nodes in Snowflake Notebooks.
Container Runtime for ML on multi-node clusters enables you to:
Scale ML workloads: Dynamically adjust the number of nodes in your compute pool to match the resource needs of your ML tasks.
Run distributed training: Train ML models on larger datasets using distributed frameworks like PyTorch, LightGBM, and XGBoost.
Manage cluster resources: Easily scale up for resource-intensive tasks and scale down when fewer resources are needed.
Control scaling operations: Configure asynchronous scaling, timeout thresholds, and minimum node requirements to match your workflow needs.
Key benefits of Container Runtime for ML on multi-node clusters include:
Improved performance: Process larger datasets and accelerate training of complex models through parallelization.
Resource efficiency: Scale resources up or down based on workload requirements without provisioning new compute pools.
Flexibility: Support for synchronous or asynchronous scaling operations to match your development workflow.
Simplicity: Straightforward APIs for scaling clusters and monitoring active nodes with minimal configuration.
To get started with Container Runtime for ML on multi-node clusters, see Container Runtime for ML on multi-node clusters.