April 22, 2025 — Container Runtime for ML on multi-node clusters — Preview

Snowflake announces the preview of Container Runtime for ML on multi-node clusters, a new capability that allows you to scale your ML workloads across multiple compute nodes in Snowflake Notebooks.

Container Runtime for ML on multi-node clusters enables you to:

  • Scale ML workloads: Dynamically adjust the number of nodes in your compute pool to match the resource needs of your ML tasks.

  • Run distributed training: Train ML models on larger datasets using distributed frameworks like PyTorch, LightGBM, and XGBoost.

  • Manage cluster resources: Easily scale up for resource-intensive tasks and scale down when fewer resources are needed.

  • Control scaling operations: Configure asynchronous scaling, timeout thresholds, and minimum node requirements to match your workflow needs.

Key benefits of Container Runtime for ML on multi-node clusters include:

  • Improved performance: Process larger datasets and accelerate training of complex models through parallelization.

  • Resource efficiency: Scale resources up or down based on workload requirements without provisioning new compute pools.

  • Flexibility: Support for synchronous or asynchronous scaling operations to match your development workflow.

  • Simplicity: Straightforward APIs for scaling clusters and monitoring active nodes with minimal configuration.

To get started with Container Runtime for ML on multi-node clusters, see Container Runtime for ML on multi-node clusters.