Cluster Manager¶

The cluster manager provides utilities for managing Ray clusters in Snowflake ML Runtime, including scaling operations, cluster monitoring, and resource management.

Functions¶

Cluster Scaling¶

snowflake.ml.runtime_cluster.cluster_manager.scale_cluster(expected_cluster_size: int, *, notebook_name: str | None = None, is_async: bool = False, options: Dict[str, Any] | None = None) → bool

Scale the Ray cluster associated with a notebook to the desired size.

This function sends a request to scale the cluster, then waits until the scaling operation completes or meets minimum size requirements.

Parameters:

expected_cluster_size (int) – Desired cluster size (number of total nodes including the head node). For example, a value of 3 means 1 head node + 2 worker nodes.
notebook_name (str, optional) – The name of the notebook to scale. If not provided, it will be retrieved from the OBJECT_NAME environment variable.
is_async (bool, optional) –
Controls whether the function blocks waiting for scaling:
- If False (default): The function blocks until either the cluster is fully ready or the operation times out. If scaling doesn’t complete in time, the cluster will automatically roll back to its previous state.
- If True: The function returns immediately after confirming the scaling request has been accepted, without waiting for the full cluster to be ready. This is useful for non-blocking operations.
options (Dict[str, Any], optional) –
Advanced low-level configuration options for cluster scaling. These options are primarily for advanced use cases and debugging:
- rollback_after_seconds (int): Maximum time in seconds before the scaling operation is automatically rolled back if not completed. Defaults to 720 seconds.
- block_until_min_cluster_size (int): Number of nodes in the cluster that must be fully running before returning. If this threshold is reached before the timeout, the function returns, leaving any remaining nodes to come up (or fail) in the background. Note that if ‘is_async’ is set to True, this value will be overridden to 1 regardless of what is specified here.

Returns:

True if scaling succeeded and the expected cluster size (or minimum required size) is reached.

Return type:

bool

Cluster Information¶

snowflake.ml.runtime_cluster.cluster_manager.get_cluster_size() → int

Get the current number of alive nodes in the Ray cluster.

Returns:: The number of alive nodes (including the head node).
Return type:: int

snowflake.ml.runtime_cluster.cluster_manager.get_nodes() → list

Get the current active nodes in the Ray cluster.

Returns:: A list of active node details. Only the node name, number of CPUs and GPUs are returned.
Return type:: list

snowflake.ml.runtime_cluster.cluster_manager.get_ray_dashboard_url(notebook_name: str | None = None) → str

Get the public Ray dashboard url for the given notebook.

Returns:: The Ray dashboard url.
Return type:: str

Resource Management¶

snowflake.ml.runtime_cluster.cluster_manager.get_available_cpu() → int

Get the number of available CPUs for current ray cluster.

Returns:: number of available CPUs.
Return type:: int

snowflake.ml.runtime_cluster.cluster_manager.get_available_gpu() → int

Get the number of available GPUs for current ray cluster.

Returns:: number of available GPUs.
Return type:: int

snowflake.ml.runtime_cluster.cluster_manager.get_num_cpus_per_node() → int

Get the number of CPUs per node for current ray cluster.

Returns:: number of CPUs per node.
Return type:: int