Cluster Manager¶
The cluster manager provides utilities for managing Ray clusters in Snowflake ML Runtime, including scaling operations, cluster monitoring, and resource management.
Functions¶
Cluster Scaling¶
- snowflake.ml.runtime_cluster.cluster_manager.scale_cluster(expected_cluster_size: int, *, notebook_name: str | None = None, is_async: bool = False, options: Dict[str, Any] | None = None) bool
Scale the Ray cluster associated with a notebook to the desired size.
This function sends a request to scale the cluster, then waits until the scaling operation completes or meets minimum size requirements.
- Parameters:
expected_cluster_size (int) – Desired cluster size (number of total nodes including the head node). For example, a value of 3 means 1 head node + 2 worker nodes.
notebook_name (str, optional) – The name of the notebook to scale. If not provided, it will be retrieved from the OBJECT_NAME environment variable.
is_async (bool, optional) –
Controls whether the function blocks waiting for scaling:
If False (default): The function blocks until either the cluster is fully ready or the operation times out. If scaling doesn’t complete in time, the cluster will automatically roll back to its previous state.
If True: The function returns immediately after confirming the scaling request has been accepted, without waiting for the full cluster to be ready. This is useful for non-blocking operations.
options (Dict[str, Any], optional) –
Advanced low-level configuration options for cluster scaling. These options are primarily for advanced use cases and debugging:
rollback_after_seconds (int): Maximum time in seconds before the scaling operation is automatically rolled back if not completed. Defaults to 720 seconds.
block_until_min_cluster_size (int): Number of nodes in the cluster that must be fully running before returning. If this threshold is reached before the timeout, the function returns, leaving any remaining nodes to come up (or fail) in the background. Note that if ‘is_async’ is set to True, this value will be overridden to 1 regardless of what is specified here.
- Returns:
True if scaling succeeded and the expected cluster size (or minimum required size) is reached.
- Return type:
bool
Cluster Information¶
- snowflake.ml.runtime_cluster.cluster_manager.get_cluster_size() int
Get the current number of alive nodes in the Ray cluster.
- Returns:
The number of alive nodes (including the head node).
- Return type:
int
- snowflake.ml.runtime_cluster.cluster_manager.get_nodes() list
Get the current active nodes in the Ray cluster.
- Returns:
A list of active node details.
- Return type:
list
- snowflake.ml.runtime_cluster.cluster_manager.get_ray_dashboard_url(notebook_name: str | None = None) str
Get the public Ray dashboard url for the given notebook.
- Returns:
The Ray dashboard url.
- Return type:
str
Resource Management¶
- snowflake.ml.runtime_cluster.cluster_manager.get_available_cpu() int
Get the number of available CPUs for current ray cluster.
- Returns:
number of available CPUs.
- Return type:
int
- snowflake.ml.runtime_cluster.cluster_manager.get_available_gpu() int
Get the number of available GPUs for current ray cluster.
- Returns:
number of available GPUs.
- Return type:
int
- snowflake.ml.runtime_cluster.cluster_manager.get_num_cpus_per_node() int
Get the number of CPUs per node for current ray cluster.
- Returns:
number of CPUs per node.
- Return type:
int