Compute Pool Metrics¶
Compute pool metrics provide data points that include:
Data points about nodes in the compute pool (for example, the amount of free memory available for use by containers on a node).
Data points about the services and jobs running on compute pool nodes (for example, memory used by a specific container).
Each node provides node-specific metrics, including metrics for services running on the node. Note that there is no metrics aggregation across nodes. The metrics publisher on each node listens on TCP port 9001. To access the metrics, connect to the node directly using its IP address. To discover the node’s IP address, you retrieve SRV records (or A records) for the discover.monitor.<compute_pool_name>.snowflakecomputing.internal
hostname from DNS.
Services and jobs in your account can query IPs to retrieve metrics from nodes in a compute pool. The role needs either the OWNERSHIP or MONITOR privilege on the compute pool to access metrics.
Services and jobs can connect to each node’s metrics publisher using an HTTP GET request with the path /metrics
.
The body in the response provides the metrics using the
Prometheus format
as shown in the following example metrics:
Example:
# HELP spcs_node_allocatable Defines SPCS compute pool resources available on the node
# TYPE spcs_node_allocatable gauge
spcs_node_allocatable{resource="cpu",snow_containers_compute_pool_name="MY_POOL",snow_instance_family="CPU_X64_S",snow_node_id="775776fd"} 1
spcs_node_allocatable{resource="memory",snow_containers_compute_pool_name="MY_POOL",snow_instance_family="CPU_X64_S",snow_node_id="775776fd"} 7.21397383168e+09
Note that:
Each metric starts with
# HELP
and# TYPE
to provide a short description and the type of the metric. In this example, thespcs_node_allocatable
metric is of type gauge.It is then followed by the metric’s name, a list of labels describing a specific resource (data point), and its value. In this example, the metric (named
spcs_node_allocatable
) provides CPU and memory information, indicating that the node has 1 CPU core that can be allocated and has 7.2 GB available memory. For example, the first CPU metrics includes these labels.resource="cpu", snow_containers_compute_pool_name="MY_POOL", snow_instance_family="CPU_X64_S", snow_node_id="775776fd"
You can process these metrics any way you choose; for example, you might store metrics in a database and use a UI to display the information. You might create a Grafana dashboard. For examples, see compute pool metrics tutorials.
For a list of available compute pool metrics, see List of available metrics.
General guidelines¶
The following guidelines apply when working with compute pool metrics:
In order for you to access the metrics for a compute pool, the compute pool must have a DNS-compatible name.
The endpoint exposed by a compute pool can be accessed only by roles that have the OWNERSHIP or MONITOR privilege on the compute pool.
List of available metrics¶
Each compute pool publishes a set of metrics about the nodes in the compute pool and the containers running in that compute pool. You can create a service to monitor these metrics. For examples, see the Compute pool metrics tutorials.
The metrics values represent the state of the system at the time you query the metrics.
Metric Name |
Type |
Description |
Labels |
---|---|---|---|
spcs_container_cpu_usage |
Counter |
The metric represents the cumulative (total) CPU time in seconds that a specific container has consumed since it started running. It gives you the total CPU time consumed by the container during its lifetime. |
snow_account_name snow_database_name snow_schema_name snow_containers_compute_pool_name snow_executable_name snow_containers_container_name snow_containers_instance_name snow_node_id |
spcs_container_memory_usage |
Gauge |
The amount of memory, in bytes, used by a container. |
snow_account_name snow_database_name snow_schema_name snow_containers_compute_pool_name snow_executable_name snow_containers_container_name snow_containers_instance_name snow_node_id |
spcs_container_restarts |
Counter |
The number of times Snowflake restarted the container. |
snow_account_name snow_database_name snow_schema_name snow_containers_compute_pool_name snow_executable_name snow_containers_container_name snow_containers_instance_name snow_node_id |
spcs_container_resource_requests |
Gauge |
The requested resources (as per service specification) for the container. The |
snow_account_name snow_database_name snow_schema_name snow_containers_compute_pool_name snow_executable_name snow_containers_container_name snow_containers_instance_name snow_node_id |
spcs_container_resource_limits |
Gauge |
The resource limit (as per service specification) for the container. The |
snow_account_name snow_database_name snow_schema_name snow_containers_compute_pool_name snow_executable_name snow_containers_container_name snow_containers_instance_name snow_node_id |
spcs_container_gpu_utilization |
Gauge |
The GPU compute utilization, where 1.0 indicates 100% utilization. The |
snow_account_name snow_database_name snow_schema_name snow_containers_compute_pool_name snow_executable_name snow_containers_container_name snow_containers_instance_name snow_node_id gpu |
spcs_container_gpu_memory_utilization |
Gauge |
The amount of GPU memory, in bytes, used by a container.
at a given time. The |
snow_account_name snow_database_name snow_schema_name snow_containers_compute_pool_name snow_executable_name snow_containers_container_name snow_containers_instance_name snow_node_id gpu |
spcs_container_status_unschedulable |
Gauge |
When reported, this metric value will always be 1 and it indicates Snowflake is not able to schedule the container on the compute pool. |
snow_account_name snow_database_name snow_schema_name snow_containers_compute_pool_name snow_executable_name snow_containers_container_name snow_containers_instance_name snow_node_id |
spcs_container_state_started |
Gauge |
When reported, this metric value will always be 1 and it indicates Snowflake is started the container but it is not yet running. |
snow_account_name snow_database_name snow_schema_name snow_containers_compute_pool_name snow_executable_name snow_containers_container_name snow_containers_instance_name snow_node_id |
spcs_container_state_running |
Gauge |
When reported, this metric value will always be 1 and it indicates the container is running and available through the defined endpoints. |
snow_account_name snow_database_name snow_schema_name snow_containers_compute_pool_name snow_executable_name snow_containers_container_name snow_containers_instance_name snow_node_id |
spcs_container_state_pending |
Gauge |
When reported, this metric value will always be 1 and it indicates Snowflake is scheduling the container. |
snow_account_name snow_database_name snow_schema_name snow_containers_compute_pool_name snow_executable_name snow_containers_container_name snow_containers_instance_name snow_node_id |
spcs_container_state_pending_reason |
Gauge |
When reported, this metric value will always be 1 and the |
snow_account_name snow_database_name snow_schema_name snow_containers_compute_pool_name snow_executable_name snow_containers_container_name snow_containers_instance_name snow_node_id reason |
spcs_container_state_finished |
Gauge |
When reported, this metric value will always be 1 and it indicates the container finished execution. |
snow_account_name snow_database_name snow_schema_name snow_containers_compute_pool_name snow_executable_name snow_containers_container_name snow_containers_instance_name snow_node_id |
spcs_container_state_last_finished_reason |
Gauge |
When reported, this metric value will always be 1 and the |
snow_account_name snow_database_name snow_schema_name snow_containers_compute_pool_name snow_executable_name snow_containers_container_name snow_containers_instance_name snow_node_id reason |
spcs_node_allocatable |
Gauge |
The available resources for each node. The |
snow_instance_family snow_node_id resource |
spcs_volume_available_bytes |
Gauge |
The space available for service use from the amount of space specified in |
snow_account_name snow_database_name snow_schema_name snow_containers_compute_pool_name snow_executable_name snow_containers_container_name snow_containers_instance_name snow_node_id |
spcs_volume_free_bytes |
Gauge |
The total space available in the filesystem. |
snow_account_name snow_database_name snow_schema_name snow_containers_compute_pool_name snow_executable_name snow_containers_container_name snow_containers_instance_name snow_node_id |
spcs_volume_io_inflight |
Gauge |
The number of active filesystem I/O operations. |
snow_account_name snow_database_name snow_schema_name snow_containers_compute_pool_name snow_executable_name snow_containers_container_name snow_containers_instance_name snow_node_id |
spcs_volume_read_bytes_total |
Counter |
The total number of bytes read from the filesystem. |
snow_account_name snow_database_name snow_schema_name snow_containers_compute_pool_name snow_executable_name snow_containers_container_name snow_containers_instance_name snow_node_id |
spcs_volume_read_completed_total |
Counter |
The total number of completed filesystem read operations. |
snow_account_name snow_database_name snow_schema_name snow_containers_compute_pool_name snow_executable_name snow_containers_container_name snow_containers_instance_name snow_node_id |
spcs_volume_read_time_seconds_total |
Counter |
The cumulative time spent on all filesystem read operations in seconds. |
snow_account_name snow_database_name snow_schema_name snow_containers_compute_pool_name snow_executable_name snow_containers_container_name snow_containers_instance_name snow_node_id |
spcs_volume_size_bytes |
Gauge |
The total capacity of the filesystem. |
snow_account_name snow_database_name snow_schema_name snow_containers_compute_pool_name snow_executable_name snow_containers_container_name now_containers_instance_name snow_node_id |
spcs_volume_write_bytes_total |
Counter |
The total number of bytes written to the filesystem. |
snow_account_name snow_database_name snow_schema_name snow_containers_compute_pool_name snow_executable_name snow_containers_container_name snow_containers_instance_name snow_node_id |
spcs_volume_write_completed_total |
Counter |
The total number of completed filesystem write operations. |
snow_account_name snow_database_name snow_schema_name snow_containers_compute_pool_name snow_executable_name snow_containers_container_name snow_containers_instance_name snow_node_id |
spcs_volume_write_time_seconds_total |
Counter |
The cumulative time spent on all filesystem write operations in seconds. |
snow_account_name snow_database_name snow_schema_name snow_containers_compute_pool_name snow_executable_name snow_containers_container_name snow_containers_instance_name snow_node_id |
The labels are described below:
snow_account_name
: Account name that launched the service.snow_database_name
: Name of the database that owns the servicesnow_schema_name
: Name of the schema that owns the servicesnow_containers_compute_pool_name
: Name of the compute pool where service was scheduled onsnow_executable_name
: Service name.snow_containers_container_name
: Container namesnow_containers_instance_name
: Id of the container instance.snow_node_id
: Id of the node in the compute pool.gpu
: gpu number allocated to the container, starting with 0.reason
: Explains the container state. This label appears only for metrics that end withreason
suffixspcs.container.state.pending.reason
FailedToPullImage
: Container cannot pull image.FailingToStartContainer
: Container cannot be started. It is getting scheduled to the node, but then fails.ServiceRunError
: Runtime error occurred resulting in the container eviction.ServiceSpecError
: Container cannot be scheduled because error in service specification.ServiceCreateError
: Error during container initialization.Initializing
: Container is currently initializing.Creating
: Container in process of creating, for example, pulling an image.
snow_instance_family
: Compute pool type to which node belongs (see CREATE COMPUTE POOL)resource
: Node resource (cpu, memory, gpu, gpu_memory).unit
: Metric granularity(for example, bytes, seconds).
Examples¶
The following tutorials are provided. The instructions explain how you run the sample code that consumes the metrics and provides visualization.