Compute Pool Metrics

Compute pool metrics provide data points that include:

  • Data points about nodes in the compute pool (for example, the amount of free memory available for use by containers on a node).

  • Data points about the services and jobs running on compute pool nodes (for example, memory used by a specific container).

Each node provides node-specific metrics, including metrics for services running on the node. Note that there is no metrics aggregation across nodes. The metrics publisher on each node listens on TCP port 9001. To access the metrics, connect to the node directly using its IP address. To discover the node’s IP address, you retrieve SRV records (or A records) for the discover.monitor.<compute_pool_name>.snowflakecomputing.internal hostname from DNS.

Services and jobs in your account can query IPs to retrieve metrics from nodes in a compute pool. The role needs either the OWNERSHIP or MONITOR privilege on the compute pool to access metrics.

Services and jobs can connect to each node’s metrics publisher using an HTTP GET request with the path /metrics. The body in the response provides the metrics using the Prometheus format as shown in the following example metrics:

Example:

# HELP spcs_node_allocatable Defines SPCS compute pool resources available on the node
# TYPE spcs_node_allocatable gauge
spcs_node_allocatable{resource="cpu",snow_containers_compute_pool_name="MY_POOL",snow_instance_family="CPU_X64_S",snow_node_id="775776fd"} 1
spcs_node_allocatable{resource="memory",snow_containers_compute_pool_name="MY_POOL",snow_instance_family="CPU_X64_S",snow_node_id="775776fd"} 7.21397383168e+09

Note that:

  • Each metric starts with # HELP and # TYPE to provide a short description and the type of the metric. In this example, the spcs_node_allocatable metric is of type gauge.

  • It is then followed by the metric’s name, a list of labels describing a specific resource (data point), and its value. In this example, the metric (named spcs_node_allocatable) provides CPU and memory information, indicating that the node has 1 CPU core that can be allocated and has 7.2 GB available memory. For example, the first CPU metrics includes these labels.

    resource="cpu",
    snow_containers_compute_pool_name="MY_POOL",
    snow_instance_family="CPU_X64_S",
    snow_node_id="775776fd"
    

You can process these metrics any way you choose; for example, you might store metrics in a database and use a UI to display the information. You might create a Grafana dashboard. For examples, see compute pool metrics tutorials.

For a list of available compute pool metrics, see List of available metrics.

General guidelines

The following guidelines apply when working with compute pool metrics:

  • In order for you to access the metrics for a compute pool, the compute pool must have a DNS-compatible name.

  • The endpoint exposed by a compute pool can be accessed only by roles that have the OWNERSHIP or MONITOR privilege on the compute pool.

List of available metrics

Each compute pool publishes a set of metrics about the nodes in the compute pool and the containers running in that compute pool. You can create a service to monitor these metrics. For examples, see the Compute pool metrics tutorials.

The metrics values represent the state of the system at the time you query the metrics.

Metric Name

Type

Description

Labels

spcs_container_cpu_usage

Counter

The metric represents the cumulative (total) CPU time in seconds that a specific container has consumed since it started running. It gives you the total CPU time consumed by the container during its lifetime.

snow_account_name snow_database_name snow_schema_name snow_containers_compute_pool_name snow_executable_name snow_containers_container_name snow_containers_instance_name snow_node_id

spcs_container_memory_usage

Gauge

The amount of memory, in bytes, used by a container.

snow_account_name snow_database_name snow_schema_name snow_containers_compute_pool_name snow_executable_name snow_containers_container_name snow_containers_instance_name snow_node_id

spcs_container_restarts

Counter

The number of times Snowflake restarted the container.

snow_account_name snow_database_name snow_schema_name snow_containers_compute_pool_name snow_executable_name snow_containers_container_name snow_containers_instance_name snow_node_id

spcs_container_resource_requests

Gauge

The requested resources (as per service specification) for the container. The resource label indicates if this is count of CPU cores, bytes of CPU memory, count of GPUs, or bytes of GPU memory.

snow_account_name snow_database_name snow_schema_name snow_containers_compute_pool_name snow_executable_name snow_containers_container_name snow_containers_instance_name snow_node_id

spcs_container_resource_limits

Gauge

The resource limit (as per service specification) for the container. The resource label indicates if this is count of CPU cores, bytes of CPU memory, count of GPUs, or bytes of GPU memory.

snow_account_name snow_database_name snow_schema_name snow_containers_compute_pool_name snow_executable_name snow_containers_container_name snow_containers_instance_name snow_node_id

spcs_container_gpu_utilization

Gauge

The GPU compute utilization, where 1.0 indicates 100% utilization. The gpu label identifies the GPU.

snow_account_name snow_database_name snow_schema_name snow_containers_compute_pool_name snow_executable_name snow_containers_container_name snow_containers_instance_name snow_node_id gpu

spcs_container_gpu_memory_utilization

Gauge

The amount of GPU memory, in bytes, used by a container. at a given time. The gpu label identifies the GPU.

snow_account_name snow_database_name snow_schema_name snow_containers_compute_pool_name snow_executable_name snow_containers_container_name snow_containers_instance_name snow_node_id gpu

spcs_container_status_unschedulable

Gauge

When reported, this metric value will always be 1 and it indicates Snowflake is not able to schedule the container on the compute pool.

snow_account_name snow_database_name snow_schema_name snow_containers_compute_pool_name snow_executable_name snow_containers_container_name snow_containers_instance_name snow_node_id

spcs_container_state_started

Gauge

When reported, this metric value will always be 1 and it indicates Snowflake is started the container but it is not yet running.

snow_account_name snow_database_name snow_schema_name snow_containers_compute_pool_name snow_executable_name snow_containers_container_name snow_containers_instance_name snow_node_id

spcs_container_state_running

Gauge

When reported, this metric value will always be 1 and it indicates the container is running and available through the defined endpoints.

snow_account_name snow_database_name snow_schema_name snow_containers_compute_pool_name snow_executable_name snow_containers_container_name snow_containers_instance_name snow_node_id

spcs_container_state_pending

Gauge

When reported, this metric value will always be 1 and it indicates Snowflake is scheduling the container.

snow_account_name snow_database_name snow_schema_name snow_containers_compute_pool_name snow_executable_name snow_containers_container_name snow_containers_instance_name snow_node_id

spcs_container_state_pending_reason

Gauge

When reported, this metric value will always be 1 and the reason label provides additional information.

snow_account_name snow_database_name snow_schema_name snow_containers_compute_pool_name snow_executable_name snow_containers_container_name snow_containers_instance_name snow_node_id reason

spcs_container_state_finished

Gauge

When reported, this metric value will always be 1 and it indicates the container finished execution.

snow_account_name snow_database_name snow_schema_name snow_containers_compute_pool_name snow_executable_name snow_containers_container_name snow_containers_instance_name snow_node_id

spcs_container_state_last_finished_reason

Gauge

When reported, this metric value will always be 1 and the reason label provides additional information.

snow_account_name snow_database_name snow_schema_name snow_containers_compute_pool_name snow_executable_name snow_containers_container_name snow_containers_instance_name snow_node_id reason

spcs_node_allocatable

Gauge

The available resources for each node. The resource label indicates if this is count of CPU cores, bytes of CPU memory, count of GPUs, or bytes of GPU memory.

snow_instance_family snow_node_id resource

spcs_volume_available_bytes

Gauge

The space available for service use from the amount of space specified in spcs_volume_free_bytes.

snow_account_name snow_database_name snow_schema_name snow_containers_compute_pool_name snow_executable_name snow_containers_container_name snow_containers_instance_name snow_node_id

spcs_volume_free_bytes

Gauge

The total space available in the filesystem.

snow_account_name snow_database_name snow_schema_name snow_containers_compute_pool_name snow_executable_name snow_containers_container_name snow_containers_instance_name snow_node_id

spcs_volume_io_inflight

Gauge

The number of active filesystem I/O operations.

snow_account_name snow_database_name snow_schema_name snow_containers_compute_pool_name snow_executable_name snow_containers_container_name snow_containers_instance_name snow_node_id

spcs_volume_read_bytes_total

Counter

The total number of bytes read from the filesystem.

snow_account_name snow_database_name snow_schema_name snow_containers_compute_pool_name snow_executable_name snow_containers_container_name snow_containers_instance_name snow_node_id

spcs_volume_read_completed_total

Counter

The total number of completed filesystem read operations.

snow_account_name snow_database_name snow_schema_name snow_containers_compute_pool_name snow_executable_name snow_containers_container_name snow_containers_instance_name snow_node_id

spcs_volume_read_time_seconds_total

Counter

The cumulative time spent on all filesystem read operations in seconds.

snow_account_name snow_database_name snow_schema_name snow_containers_compute_pool_name snow_executable_name snow_containers_container_name snow_containers_instance_name snow_node_id

spcs_volume_size_bytes

Gauge

The total capacity of the filesystem.

snow_account_name snow_database_name snow_schema_name snow_containers_compute_pool_name snow_executable_name snow_containers_container_name now_containers_instance_name snow_node_id

spcs_volume_write_bytes_total

Counter

The total number of bytes written to the filesystem.

snow_account_name snow_database_name snow_schema_name snow_containers_compute_pool_name snow_executable_name snow_containers_container_name snow_containers_instance_name snow_node_id

spcs_volume_write_completed_total

Counter

The total number of completed filesystem write operations.

snow_account_name snow_database_name snow_schema_name snow_containers_compute_pool_name snow_executable_name snow_containers_container_name snow_containers_instance_name snow_node_id

spcs_volume_write_time_seconds_total

Counter

The cumulative time spent on all filesystem write operations in seconds.

snow_account_name snow_database_name snow_schema_name snow_containers_compute_pool_name snow_executable_name snow_containers_container_name snow_containers_instance_name snow_node_id

The labels are described below:

  • snow_account_name: Account name that launched the service.

  • snow_database_name: Name of the database that owns the service

  • snow_schema_name: Name of the schema that owns the service

  • snow_containers_compute_pool_name: Name of the compute pool where service was scheduled on

  • snow_executable_name: Service name.

  • snow_containers_container_name: Container name

  • snow_containers_instance_name: Id of the container instance.

  • snow_node_id: Id of the node in the compute pool.

  • gpu: gpu number allocated to the container, starting with 0.

  • reason: Explains the container state. This label appears only for metrics that end with reason suffix

    • spcs.container.state.pending.reason

      • FailedToPullImage: Container cannot pull image.

      • FailingToStartContainer: Container cannot be started. It is getting scheduled to the node, but then fails.

      • ServiceRunError: Runtime error occurred resulting in the container eviction.

      • ServiceSpecError: Container cannot be scheduled because error in service specification.

      • ServiceCreateError: Error during container initialization.

      • Initializing: Container is currently initializing.

      • Creating: Container in process of creating, for example, pulling an image.

  • snow_instance_family: Compute pool type to which node belongs (see CREATE COMPUTE POOL)

  • resource: Node resource (cpu, memory, gpu, gpu_memory).

  • unit: Metric granularity(for example, bytes, seconds).

Examples

The following tutorials are provided. The instructions explain how you run the sample code that consumes the metrics and provides visualization.