Snowpark Container Services: Monitoring Services

Accessing container logs

Snowflake collects whatever your application containers output to standard output and standard error. You should ensure that your code outputs useful information to debug a service.

Snowflake provides two ways to access these service (including job service) container logs:

  • Using the SYSTEM$GET_SERVICE_LOGS system function: Gives access to logs from a specific container. After a container exits, you can continue to access the logs using the system function for a short time. System functions are most useful during development and testing, when you are initially authoring a service or a job. For more information, see SYSTEM$GET_SERVICE_LOGS.

  • Using an event table: The account’s event table gives you access to logs from multiple containers for services that enable log collection in their specification. Snowflake persists the logs in the event table for later access. Event tables are best used for the retrospective analysis of services and jobs. For more information, see Using event table.

Using SYSTEM$GET_SERVICE_LOGS

You provide the service name, instance ID, container name, and optionally the number of most recent log lines to retrieve. If only one service instance is running, the service instance ID is 0. For example, the following statement command retrieves the trailing 10 lines from the log of a container named echo that belongs to instance 0 of a service named echo_service:

SELECT SYSTEM$GET_SERVICE_LOGS('echo_service', '0', 'echo', 10);
Copy

Example output:

+--------------------------------------------------------------------------+
| SYSTEM$GET_SERVICE_LOGS                                                  |
|--------------------------------------------------------------------------|
| 10.16.6.163 - - [11/Apr/2023 21:44:03] "GET /healthcheck HTTP/1.1" 200 - |
| 10.16.6.163 - - [11/Apr/2023 21:44:08] "GET /healthcheck HTTP/1.1" 200 - |
| 10.16.6.163 - - [11/Apr/2023 21:44:13] "GET /healthcheck HTTP/1.1" 200 - |
| 10.16.6.163 - - [11/Apr/2023 21:44:18] "GET /healthcheck HTTP/1.1" 200 - |
+--------------------------------------------------------------------------+
1 Row(s) produced. Time Elapsed: 0.878s

If you don’t have the information about the service that you need to call the function (such as the instance ID or container name), you can first run the SHOW SERVICE CONTAINERS IN SERVICE command to get information about the service instances and containers running in each instance.

The SYSTEM$GET_SERVICE_LOGS function has the following limitations:

  • It merges standard output and standard error streams. The function provides no indication of which stream the output came from.

  • It reports the captured data for a specific container in a single service instance.

  • It only reports logs for a running container. The function cannot fetch logs from a previous container that was restarted or from a container of a service that is stopped or deleted.

  • The function returns up to 100 KB of data.

Using event table

Snowflake can capture logs sent from containers to the standard output and standard error streams into the event table configured for your account. For more information about configuring an event table, see Logging, tracing, and metrics.

You control which streams are collected (all, standard error only, or none) that you want stored in an event table by using the spec.logExporters field in the service specification file.

You can then query the event table for events. For example, the following SELECT statement retrieves Snowflake service and job events recorded in the past hour:

SELECT TIMESTAMP, RESOURCE_ATTRIBUTES, RECORD_ATTRIBUTES, VALUE
FROM <current-event-table-for-your-account>
WHERE timestamp > dateadd(hour, -1, current_timestamp())
AND RESOURCE_ATTRIBUTES:"snow.service.name" = <service-name>
AND RECORD_TYPE = 'LOG'
ORDER BY timestamp DESC
LIMIT 10;
Copy

Snowflake recommends that you include a timestamp in the WHERE clause of event table queries, as shown in this example. This is particularly important because of the potential volume of data generated by various Snowflake components. By applying filters, you can retrieve a smaller subset of data, which improves query performance.

The event table includes the following columns, which provide useful information regarding the logs collected by Snowflake from your container:

  • TIMESTAMP: Shows when Snowflake collected the log.

  • RESOURCE_ATTRIBUTES: Provides a JSON object that identifies the Snowflake service and the container in the service that generated the log message. For example, it furnishes details such as the service name, container name, and compute pool name that were specified when the service was run.

    {
      "snow.account.name": "SPCSDOCS1",
      "snow.compute_pool.id": 20,
      "snow.compute_pool.name": "TUTORIAL_COMPUTE_POOL",
      "snow.compute_pool.node.id": "a17e8157",
      "snow.compute_pool.node.instance_family": "CPU_X64_XS",
      "snow.database.id": 26,
      "snow.database.name": "TUTORIAL_DB",
      "snow.schema.id": 212,
      "snow.schema.name": "DATA_SCHEMA",
      "snow.service.container.instance": "0",
      "snow.service.container.name": "echo",
      "snow.service.container.run.id": "b30566",
       "snow.service.id": 114,
      "snow.service.name": "ECHO_SERVICE2",
      "snow.service.type": "Service"
    }
    
    Copy
  • RECORD_ATTRIBUTES: For a Snowflake service, it identifies an error source (standard output or standard error).

    { "log.iostream": "stdout" }
    
    Copy
  • VALUE: Standard output and standard error are broken into lines, and each line generates a record in the event table.

    "echo-service [2023-10-23 17:52:27,429] [DEBUG] Sending response: {'data': [[0, 'Joe said hello!']]}"
    

Accessing metrics

Snowflake provides metrics for compute pools in your account and services running on those compute pools. These metrics, provided by Snowflake, are also referred to as platform metrics.

  • Event-table service metrics: Individual services publish metrics. These are a subset of the compute pool metrics that provide information specific to the service. The target use case for this is to observe the resource utilization of a specific service. In the service specification, you define which metrics you want Snowflake to record in the event table while the service is running.

  • Compute pool metrics: Each compute pool also publishes metrics that provide information about what is happening inside that compute pool. The target use case for this is to observe the compute pool utilization. To access your compute pool metrics, you will need to write a service that uses Prometheus-compatible API to poll the metrics that the compute pool publishes.

Accessing event-table service metrics

To log metrics from a service into the event table configured for your account, include the following section in your service specification:

platformMonitor:
  metricConfig:
    groups:
    - <group 1>
    - <group 2>
    - ...
Copy

Where each group N refers to a predefined metrics group that you are interested in (for example, system, network, or storage). For more information, see the spec.platformMonitor field section in the documentation on the service specification.

While the service is running, Snowflake records these metrics to the event table in your account. You can query your event table to read the metrics. The following query retrieves the service metrics that were recorded in the past hour for the service my_service:

SELECT timestamp, value
  FROM my_event_table_db.my_event_table_schema.my_event_table
  WHERE timestamp > DATEADD(hour, -1, CURRENT_TIMESTAMP())
    AND RESOURCE_ATTRIBUTES:"snow.service.name" = 'MY_SERVICE'
    AND RECORD_TYPE = 'METRIC'
    ORDER BY timestamp DESC
    LIMIT 10;
Copy

If you don’t know the name of the active event table for the account, execute the SHOW PARAMETERS command to display the value of the account-level EVENT_TABLE parameter:

SHOW PARAMETERS LIKE 'event_table' IN ACCOUNT;
Copy

For more information about event tables, see Using event table.

Example

Follow these steps to create an example service that records metrics to the event table configured for your account.

  1. Follow Tutorial 1 to create a service named echo_service, with one change. In step 3, where you create a service, use the following CREATE SERVICE command that add the platformMonitor field in the modified service specification.

    CREATE SERVICE echo_service
      IN COMPUTE POOL tutorial_compute_pool
      FROM SPECIFICATION $$
        spec:
          containers:
          - name: echo
            image: /tutorial_db/data_schema/tutorial_repository/my_echo_service_image:latest
            env:
              SERVER_PORT: 8000
              CHARACTER_NAME: Bob
            readinessProbe:
              port: 8000
              path: /healthcheck
          endpoints:
          - name: echoendpoint
            port: 8000
            public: true
          platformMonitor:
            metricConfig:
              groups:
              - system
              - system_limits
          $$
        MIN_INSTANCES=1
        MAX_INSTANCES=1;
    
    Copy
  2. After the service is running, Snowflake starts recording the metrics in the specified metric groups to the event table, which you can then query. The following query retrieves metrics reported in the last hour by the Echo service.

    SELECT timestamp, value
      ROM my_events
      WHERE timestamp > DATEADD(hour, -1, CURRENT_TIMESTAMP())
        AND RESOURCE_ATTRIBUTES:"snow.service.name" = 'ECHO_SERVICE'
        AND RECORD_TYPE = 'METRIC'
        ORDER BY timestamp, group DESC
        LIMIT 10;
    
    Copy

Accessing compute pool metrics

Compute pool metrics offer insights into the nodes in the compute pool and the services running on them. Each node reports node-specific metrics, such as the amount of available memory for containers, as well as service metrics, like the memory usage by individual containers. The compute pool metrics provide information from a node’s perspective.

Each node has a metrics publisher that listens on TCP port 9001. Other services can make an HTTP GET request with the path /metrics to port 9001 on the node. To discover the node’s IP address, retrieve SRV records (or A records) from DNS for the discover.monitor.compute_pool_name.snowflakecomputing.internal hostname. Then, create another service in your account that actively polls each node to retrieve the metrics.

The body in the response provides the metrics using the Prometheus format as shown in the following example metrics:

# HELP node_memory_capacity Defines SPCS compute pool resource capacity on the node
# TYPE node_memory_capacity gauge
node_memory_capacity{snow_compute_pool_name="MY_POOL",snow_compute_pool_node_instance_family="CPU_X64_S",snow_compute_pool_node_id="10.244.3.8"} 1
node_cpu_capacity{snow_compute_pool_name="MY_POOL",snow_compute_pool_node_instance_family="CPU_X64_S",snow_compute_pool_node_id="10.244.3.8"} 7.21397383168e+09

Note the following:

  • The response body starts with # HELP and # TYPE, which provide a short description and the type of the metric. In this example, the node_memory_capacity metric is of type gauge.

  • It is then followed by the metric’s name, a list of labels describing a specific resource (data point), and its value. In this example, the metric (named node_memory_capacity) provides memory information, indicating that the node has 7.2 GB available memory. The metric also includes metadata in the form of labels as shown:

    snow_compute_pool_name="MY_POOL",
    snow_compute_pool_node_instance_family="CPU_X64_S",snow_compute_pool_node_id="10.244.3.8"
    

You can process these metrics any way you choose; for example, you might store metrics in a database and use a UI (such as a Grafana dashboard) to display the information.

Note

  • Snowflake does not provide any aggregation of metrics. For example, to get metrics for a given service, you must query all nodes that are running instances of that service.

  • The compute pool must have a DNS-compatible name for you to access the metrics.

  • The endpoint exposed by a compute pool can be accessed by a service using a role that has the OWNERSHIP or MONITOR privilege on the compute pool.

For a list of available compute pool metrics, see Available platform metrics.

Example

For an example of configuring Prometheus to poll your compute pool for metrics, see the compute pool metrics tutorials.

Available platform metrics

The following is a list of available platform metrics groups and metrics within each group. Note that storage metrics are currently only collected from block storage volumes.

Metric group . Metric name

Unit

Type

Description

system . container.cpu.usage

cpu cores

gauges

Average number of CPU cores used since last measurement. 1.0 indicates full utilization of 1 CPU core. Max value is number of cpu cores available to the container.

system . container.memory.usage

bytes

gauge

Memory used, in bytes.

system . container.gpu.memory.usage

bytes

gauge

Per-GPU memory used, in bytes. The source GPU is denoted in the ‘gpu’ attribute.

system . container.gpu.utilization

ratio

gauge

Ratio of per-GPU usage to capacity. The source GPU is denoted in the ‘gpu’ attribute.

system_limits . container.cpu.limit

cpu cores

gauge

CPU resource limit from the service specification. If no limit is defined, defaults to node capacity.

system_limits . container.gpu.limit

gpus

gauge

GPU count limit from the service specification. If no limit is defined, the metric is not emitted.

system_limits . container.memory.limit

bytes

gauge

Memory limit from the service specification. If no limit is defined, defaults to node capacity.

system_limits . container.cpu.requested

cpu cores

gauge

CPU resource request from the service specification. If no limit is defined, this defaults to a value chosen by Snowflake.

system_limits . container.gpu.requested

gpus

gauge

GPU count from the service specification. If no limit is defined, the metric is not emitted.

system_limits . container.memory.requested

bytes

gauge

Memory request from the service specification. If no limit is defined, this defaults to a value chosen by Snowflake.

system_limits . container.gpu.memory.capacity

bytes

gauge

Per-GPU memory capacity. The source GPU is denoted in the ‘gpu’ attribute.

status . container.restarts

restarts

gauge

Number of times Snowflake restarted the container.

status . container.state.finished

boolean

gauge

When the container is in the ‘finished’ state, this metric will be emitted with the value 1.

status . container.state.last.finished.reason

boolean

gauge

If the container has restarted previously, this metric will be emitted with the value 1. The ‘reason’ label describes why the container last finished.

status . container.state.last.finished.exitcode

integer

gauge

If a container has restarted previously, this metric will contain the exit code of the previous run.

status . container.state.pending

boolean

gauge

When a container is in the ‘pending’ state, this metric will be emitted with the value 1.

status . container.state.pending.reason

boolean

gauge

When a container is in the ‘pending’ state, this metric will be emitted with value the 1. The ‘reason’ label describes why the container was most recently in the pending state.

status . container.state.running

boolean

gauge

When a container is in the ‘running’ state, this metric will have value the 1.

status . container.state.started

boolean

gauge

When a container is in the ‘started’ state, this metric will have value the 1.

network . network.egress.denied.packets

packets

gauge

Network egress total denied packets due to policy validation failures.

network . network.egress.received.bytes

bytes

gauge

Network egress total bytes received from remote destinations.

network . network.egress.received.packets

packets

gauge

Network egress total packets received from remote destinations.

network . network.egress.transmitted.bytes

byte

gauge

Network egress total bytes transmitted out to remote destinations.

network . network.egress.transmitted.packets

packets

gauge

Network egress total packets transmitted out to remote destinations.

storage . volume.capacity

bytes

gauge

Size of the filesystem.

storage . volume.io.inflight

operations

gauge

Number of active filesystem I/O operations.

storage . volume.read.throughput

bytes/sec

gauge

Filesystem reads throughput in bytes per second.

storage . volume.read.iops

operations/sec

gauge

Filesystem read operations per second.

storage . volume.usage

bytes

gauge

Total number of bytes used in the filesystem.

storage . volume.write.throughput

bytes/sec

gauge

Filesystem write throughput in bytes per second.

storage . volume.write.iops

operations/sec

gauge

Filesystem write operations per second.