Snowpark Container Services: Monitoring Services

Accessing container logs

Snowflake collects whatever your application containers output to standard output and standard error. You should ensure that your code outputs useful information to debug a service.

Snowflake provides two ways to access these service (including job service) container logs:

  • Using the SYSTEM$GET_SERVICE_LOGS system function: Gives access to logs from a specific container. After a container exits, you can continue to access the logs using the system function for a short time. System functions are most useful during development and testing, when you are initially authoring a service or a job. For more information, see SYSTEM$GET_SERVICE_LOGS.

  • Using an event table: The account’s event table gives you access to logs from multiple containers for services that enable log collection in their specification. Snowflake persists the logs in the event table for later access. Event tables are best used for the retrospective analysis of services and jobs. For more information, see Using event table.

Using SYSTEM$GET_SERVICE_LOGS

You provide the service name, instance ID, container name, and optionally the number of most recent log lines to retrieve. If only one service instance is running, the service instance ID is 0. For example, the following statement command retrieves the trailing 10 lines from the log of a container named echo that belongs to instance 0 of a service named echo_service:

SELECT SYSTEM$GET_SERVICE_LOGS('echo_service', '0', 'echo', 10);
Copy

Example output:

+--------------------------------------------------------------------------+
| SYSTEM$GET_SERVICE_LOGS                                                  |
|--------------------------------------------------------------------------|
| 10.16.6.163 - - [11/Apr/2023 21:44:03] "GET /healthcheck HTTP/1.1" 200 - |
| 10.16.6.163 - - [11/Apr/2023 21:44:08] "GET /healthcheck HTTP/1.1" 200 - |
| 10.16.6.163 - - [11/Apr/2023 21:44:13] "GET /healthcheck HTTP/1.1" 200 - |
| 10.16.6.163 - - [11/Apr/2023 21:44:18] "GET /healthcheck HTTP/1.1" 200 - |
+--------------------------------------------------------------------------+
1 Row(s) produced. Time Elapsed: 0.878s

If you don’t have the information about the service that you need to call the function (such as the instance ID or container name), you can first run the SHOW SERVICE CONTAINERS IN SERVICE command to get information about the service instances and containers running in each instance.

The SYSTEM$GET_SERVICE_LOGS function has the following limitations:

  • It merges standard output and standard error streams. The function provides no indication of which stream the output came from.

  • It reports the captured data for a specific container in a single service instance.

  • It only reports logs for a running container. The function cannot fetch logs from a previous container that was restarted or from a container of a service that is stopped or deleted.

  • The function returns up to 100 KB of data.

Using event table

Snowflake can capture logs sent from containers to the standard output and standard error streams into the event table configured for your account. For more information about configuring an event table, see Logging, tracing, and metrics.

You control which streams are collected (all, standard error only, or none) that you want stored in an event table by using the spec.logExporters field in the service specification file.

You can then query the event table for events. To find the active event table for the account, use the SHOW PARAMETERS command to check the value of the EVENT_TABLE parameter:

SHOW PARAMETERS LIKE 'event_table' IN ACCOUNT;
Copy

The parameter specifies the active event table for the account.

Next, query that event table. The following SELECT statement retrieves Snowflake service and job events recorded in the past hour:

SELECT TIMESTAMP, RESOURCE_ATTRIBUTES, RECORD_ATTRIBUTES, VALUE
FROM <current-event-table-for-your-account>
WHERE timestamp > dateadd(hour, -1, current_timestamp())
AND RESOURCE_ATTRIBUTES:"snow.service.name" = <service-name>
AND RECORD_TYPE = 'LOG'
ORDER BY timestamp DESC
LIMIT 10;
Copy

Snowflake recommends that you include a timestamp in the WHERE clause of event table queries, as shown in this example. This is particularly important because of the potential volume of data generated by various Snowflake components. By applying filters, you can retrieve a smaller subset of data, which improves query performance.

The event table includes the following columns, which provide useful information regarding the logs collected by Snowflake from your container:

  • TIMESTAMP: Shows when Snowflake collected the log.

  • RESOURCE_ATTRIBUTES: Provides a JSON object that identifies the Snowflake service and the container in the service that generated the log message. For example, it furnishes details such as the service name, container name, and compute pool name that were specified when the service was run.

    {
      "snow.account.name": "SPCSDOCS1",
      "snow.compute_pool.id": 20,
      "snow.compute_pool.name": "TUTORIAL_COMPUTE_POOL",
      "snow.compute_pool.node.id": "a17e8157",
      "snow.compute_pool.node.instance_family": "CPU_X64_XS",
      "snow.database.id": 26,
      "snow.database.name": "TUTORIAL_DB",
      "snow.schema.id": 212,
      "snow.schema.name": "DATA_SCHEMA",
      "snow.service.container.instance": "0",
      "snow.service.container.name": "echo",
      "snow.service.container.run.id": "b30566",
       "snow.service.id": 114,
      "snow.service.name": "ECHO_SERVICE2",
      "snow.service.type": "Service"
    }
    
    Copy
  • RECORD_ATTRIBUTES: For a Snowflake service, it identifies an error source (standard output or standard error).

    { "log.iostream": "stdout" }
    
    Copy
  • VALUE: Standard output and standard error are broken into lines, and each line generates a record in the event table.

    "echo-service [2023-10-23 17:52:27,429] [DEBUG] Sending response: {'data': [[0, 'Joe said hello!']]}"
    

Accessing platform metrics

Snowflake provides metrics for compute pools in your account and services running on those compute pools. These metrics, provided by Snowflake, are also referred to as platform metrics.

  • Event-table service metrics: Individual services publish metrics. These are a subset of the compute pool metrics that provide information specific to the service. The target use case for this is to observe the resource utilization of a specific service. In the service specification, you define which metrics you want Snowflake to record in the event table while the service is running.

  • Compute pool metrics: Each compute pool also publishes metrics that provide information about what is happening inside that compute pool. The target use case for this is to observe the compute pool utilization. To access your compute pool metrics, you will need to write a service that uses Prometheus-compatible API to poll the metrics that the compute pool publishes.

Accessing event-table service metrics

To log metrics from a service into the event table configured for your account, include the following section in your service specification:

platformMonitor:
  metricConfig:
    groups:
    - <group 1>
    - <group 2>
    - ...
Copy

Where each group N refers to a predefined metrics group that you are interested in (for example, system, network, or storage). For more information, see the spec.platformMonitor field section in the documentation on the service specification.

While the service is running, Snowflake records these metrics to the event table in your account. You can query your event table to read the metrics. The following query retrieves the service metrics that were recorded in the past hour for the service my_service:

SELECT timestamp, value
  FROM my_event_table_db.my_event_table_schema.my_event_table
  WHERE timestamp > DATEADD(hour, -1, CURRENT_TIMESTAMP())
    AND RESOURCE_ATTRIBUTES:"snow.service.name" = 'MY_SERVICE'
    AND RECORD_TYPE = 'METRIC'
    ORDER BY timestamp DESC
    LIMIT 10;
Copy

If you don’t know the name of the active event table for the account, execute the SHOW PARAMETERS command to display the value of the account-level EVENT_TABLE parameter:

SHOW PARAMETERS LIKE 'event_table' IN ACCOUNT;
Copy

For more information about event tables, see Using event table.

Example

Follow these steps to create an example service that records metrics to the event table configured for your account.

  1. Follow Tutorial 1 to create a service named echo_service, with one change. In step 3, where you create a service, use the following CREATE SERVICE command that add the platformMonitor field in the modified service specification.

    CREATE SERVICE echo_service
      IN COMPUTE POOL tutorial_compute_pool
      FROM SPECIFICATION $$
        spec:
          containers:
          - name: echo
            image: /tutorial_db/data_schema/tutorial_repository/my_echo_service_image:latest
            env:
              SERVER_PORT: 8000
              CHARACTER_NAME: Bob
            readinessProbe:
              port: 8000
              path: /healthcheck
          endpoints:
          - name: echoendpoint
            port: 8000
            public: true
          platformMonitor:
            metricConfig:
              groups:
              - system
              - system_limits
          $$
        MIN_INSTANCES=1
        MAX_INSTANCES=1;
    
    Copy
  2. After the service is running, Snowflake starts recording the metrics in the specified metric groups to the event table, which you can then query. The following query retrieves metrics reported in the last hour by the Echo service.

    SELECT timestamp, value
      FROM my_events
      WHERE timestamp > DATEADD(hour, -1, CURRENT_TIMESTAMP())
        AND RESOURCE_ATTRIBUTES:"snow.service.name" = 'ECHO_SERVICE'
        AND RECORD_TYPE = 'METRIC'
        AND RECORD:metric.name = 'container.cpu.usage'
        ORDER BY timestamp DESC
        LIMIT 100;
    
    Copy

Accessing compute pool metrics

Compute pool metrics offer insights into the nodes in the compute pool and the services running on them. Each node reports node-specific metrics, such as the amount of available memory for containers, as well as service metrics, like the memory usage by individual containers. The compute pool metrics provide information from a node’s perspective.

Each node has a metrics publisher that listens on TCP port 9001. Other services can make an HTTP GET request with the path /metrics to port 9001 on the node. To discover the node’s IP address, retrieve SRV records (or A records) from DNS for the discover.monitor.compute_pool_name.cp.spcs.internal hostname. Then, create another service in your account that actively polls each node to retrieve the metrics.

The body in the response provides the metrics using the Prometheus format as shown in the following example metrics:

# HELP node_memory_capacity Defines SPCS compute pool resource capacity on the node
# TYPE node_memory_capacity gauge
node_memory_capacity{snow_compute_pool_name="MY_POOL",snow_compute_pool_node_instance_family="CPU_X64_S",snow_compute_pool_node_id="10.244.3.8"} 1
node_cpu_capacity{snow_compute_pool_name="MY_POOL",snow_compute_pool_node_instance_family="CPU_X64_S",snow_compute_pool_node_id="10.244.3.8"} 7.21397383168e+09

Note the following:

  • The response body starts with # HELP and # TYPE, which provide a short description and the type of the metric. In this example, the node_memory_capacity metric is of type gauge.

  • It is then followed by the metric’s name, a list of labels describing a specific resource (data point), and its value. In this example, the metric (named node_memory_capacity) provides memory information, indicating that the node has 7.2 GB available memory. The metric also includes metadata in the form of labels as shown:

    snow_compute_pool_name="MY_POOL",
    snow_compute_pool_node_instance_family="CPU_X64_S",snow_compute_pool_node_id="10.244.3.8"
    

You can process these metrics any way you choose; for example, you might store metrics in a database and use a UI (such as a Grafana dashboard) to display the information.

Note

  • Snowflake does not provide any aggregation of metrics. For example, to get metrics for a given service, you must query all nodes that are running instances of that service.

  • The compute pool must have a DNS-compatible name for you to access the metrics.

  • The endpoint exposed by a compute pool can be accessed by a service using a role that has the OWNERSHIP or MONITOR privilege on the compute pool.

For a list of available compute pool metrics, see Available platform metrics.

Example

For an example of configuring Prometheus to poll your compute pool for metrics, see the compute pool metrics tutorials.

Available platform metrics

The following is a list of available platform metrics groups and metrics within each group. Note that storage metrics are currently only collected from block storage volumes.

Metric group . Metric name

Unit

Type

Description

system . container.cpu.usage

cpu cores

gauges

Average number of CPU cores used since last measurement. 1.0 indicates full utilization of 1 CPU core. Max value is number of cpu cores available to the container.

system . container.memory.usage

bytes

gauge

Memory used, in bytes.

system . container.gpu.memory.usage

bytes

gauge

Per-GPU memory used, in bytes. The source GPU is denoted in the ‘gpu’ attribute.

system . container.gpu.utilization

ratio

gauge

Ratio of per-GPU usage to capacity. The source GPU is denoted in the ‘gpu’ attribute.

system_limits . container.cpu.limit

cpu cores

gauge

CPU resource limit from the service specification. If no limit is defined, defaults to node capacity.

system_limits . container.gpu.limit

gpus

gauge

GPU count limit from the service specification. If no limit is defined, the metric is not emitted.

system_limits . container.memory.limit

bytes

gauge

Memory limit from the service specification. If no limit is defined, defaults to node capacity.

system_limits . container.cpu.requested

cpu cores

gauge

CPU resource request from the service specification. If no limit is defined, this defaults to a value chosen by Snowflake.

system_limits . container.gpu.requested

gpus

gauge

GPU count from the service specification. If no limit is defined, the metric is not emitted.

system_limits . container.memory.requested

bytes

gauge

Memory request from the service specification. If no limit is defined, this defaults to a value chosen by Snowflake.

system_limits . container.gpu.memory.capacity

bytes

gauge

Per-GPU memory capacity. The source GPU is denoted in the ‘gpu’ attribute.

status . container.restarts

restarts

gauge

Number of times Snowflake restarted the container.

status . container.state.finished

boolean

gauge

When the container is in the ‘finished’ state, this metric will be emitted with the value 1.

status . container.state.last.finished.reason

boolean

gauge

If the container has restarted previously, this metric will be emitted with the value 1. The ‘reason’ label describes why the container last finished.

status . container.state.last.finished.exitcode

integer

gauge

If a container has restarted previously, this metric will contain the exit code of the previous run.

status . container.state.pending

boolean

gauge

When a container is in the ‘pending’ state, this metric will be emitted with the value 1.

status . container.state.pending.reason

boolean

gauge

When a container is in the ‘pending’ state, this metric will be emitted with value the 1. The ‘reason’ label describes why the container was most recently in the pending state.

status . container.state.running

boolean

gauge

When a container is in the ‘running’ state, this metric will have value the 1.

status . container.state.started

boolean

gauge

When a container is in the ‘started’ state, this metric will have value the 1.

network . network.egress.denied.packets

packets

gauge

Network egress total denied packets due to policy validation failures.

network . network.egress.received.bytes

bytes

gauge

Network egress total bytes received from remote destinations.

network . network.egress.received.packets

packets

gauge

Network egress total packets received from remote destinations.

network . network.egress.transmitted.bytes

byte

gauge

Network egress total bytes transmitted out to remote destinations.

network . network.egress.transmitted.packets

packets

gauge

Network egress total packets transmitted out to remote destinations.

storage . volume.capacity

bytes

gauge

Size of the filesystem.

storage . volume.io.inflight

operations

gauge

Number of active filesystem I/O operations.

storage . volume.read.throughput

bytes/sec

gauge

Filesystem reads throughput in bytes per second.

storage . volume.read.iops

operations/sec

gauge

Filesystem read operations per second.

storage . volume.usage

bytes

gauge

Total number of bytes used in the filesystem.

storage . volume.write.throughput

bytes/sec

gauge

Filesystem write throughput in bytes per second.

storage . volume.write.iops

operations/sec

gauge

Filesystem write operations per second.

Accessing service query history

You can find queries executed by your service by filtering the QUERY_HISTORY view or QUERY_HISTORY function where user_type is SNOWFLAKE_SERVICE.

Example 1: Fetch queries run by a service.

SELECT *
FROM snowflake.account_usage.query_history
WHERE user_type = 'SNOWFLAKE_SERVICE'
AND user_name = '<service_name>'
AND user_database_name = '<service_db_name>'
AND user_schema_name = '<service_schema_name>'
order by start_time;
Copy

In the WHERE clause:

  • user_name = '<service_name>': You specify the service name as the user name because a service executes queries as the service user, and the service user’s name is the same as the service name.

  • user_type = 'SNOWFLAKE_SERVICE' and user_name = '<service_name>': This limits the query result to retrieve only queries executed by a service.

  • user_database_name and user_schema_name: For a service user, these are the service’s database and schema.

You can get the same results by calling the QUERY_HISTORY function.

SELECT *
FROM TABLE(<service_db_name>.information_schema.query_history())
WHERE user_database_name = '<service_db_name>'
AND user_schema_name = '<service_schema_name>'
AND user_type = 'SNOWFLAKE_SERVICE'
AND user_name = '<service_name>'
order by start_time;
Copy

In the WHERE clause:

  • user_type = 'SNOWFLAKE_SERVICE' and user_name = '<service_name>' limit the query result to retrieve only queries executed by a service.

  • user_database_name and user_schema_name names (for a service user) are the service’s database and schema.

Example 2: Fetch queries run by services and the corresponding service information.

SELECT query_history.*, services.*
FROM snowflake.account_usage.query_history
JOIN snowflake.account_usage.services
ON query_history.user_name = services.service_name
AND query_history.user_schema_id = services.service_schema_id
AND query_history.user_type = 'SNOWFLAKE_SERVICE'
Copy

The query joins the QUERY_HISTORY and SERVICES views to retrieve information about the queries and services that executed the queries. Note the following:

  • For queries run by services, the query_history.user_name is the service user’s name, which is the same as the service name.

  • The query joins the views using the schema IDs (not schema name) to ensure you refer to the same schema, because if you drop and recreate a schema, the schema ID changes but the name remains the same.

You can add optional filters to the query. For example:

  • Filter query_history to retrieve only services that executed specific queries.

  • Filter services to retrieve only queries executed by specific services.

Example 3: For every service, fetch service user information.

SELECT services.*, users.*
FROM snowflake.account_usage.users
JOIN snowflake.account_usage.services
ON users.name = services.service_name
AND users.schema_id = services.service_schema_id
AND users.type = 'SNOWFLAKE_SERVICE'
Copy

The query join SERVICES and USERS views in the ACCOUNT_USAGE schema to retrieve services and service user information. Note the following:

  • When a service runs queries, it runs the queries as service user and the service user’s name is the same as the service name. Therefore, you specify the join condition: users.name = services.service_name.

  • Service names are unique only within a schema. Therefore, the query specifies the join condition (users.schema_id = services.service_schema_id) to ensure each service user is matched against the specific service they belong to (and not any other same-named service running in different schemas).

Publishing and accessing application metrics

Application metrics and traces are generated by your service in contrast to platform metrics that Snowflake generates. Your service containers can generate OLTP or Prometheus metrics and Snowflake publishes them to the event table configured for your account.

Note that you should ensure that your service container code outputs metrics with the correct units, aggregation, and instrumentation types to generate metrics that are meaningful and effective for your analysis.

Publishing OTLP application metrics and traces

Snowflake runs an OTel collector that your service container can use to publish OTLP application metrics and traces. That is, a service container can push metrics to the OTel collector endpoints, which Snowflake then writes to the event table configured for your Snowflake account along with the originating service details.

It works as follows:

  • Snowflake automatically populates the following environment variables in your service container that provide the OTel collector endpoints where containers can publish application metrics and traces:.

    • OTEL_EXPORTER_OTLP_METRICS_ENDPOINT

    • OTEL_EXPORTER_OTLP_TRACES_ENDPOINT

  • The standard OTLP client looks for these environment variables to discover the OTel collector automatically. This enables your service container to publish metrics and traces using this client.

Configuring OTLP application Trace IDs

Traces must use the Snowflake Trace ID format to be viewable in Snowflake Trail and allow for performant lookup.

Snowflake provides Python and Java libraries to simplify Trace ID generator setup. The following examples show how to override the default OpenTelemetry trace ID generator with these libraries.

from opentelemetry.sdk.trace import TracerProvider
from snowflake.telemetry.trace import SnowflakeTraceIdGenerator


trace_id_generator = SnowflakeTraceIdGenerator()
tracer_provider = TracerProvider(
    resource=Resource.create({"service.name": SERVICE_NAME}),
    id_generator=trace_id_generator
)
Copy

For more information, see snowflake-telemetry-python on PyPI.

import io.opentelemetry.sdk.autoconfigure.AutoConfiguredOpenTelemetrySdk;
import com.snowflake.telemetry.trace.SnowflakeTraceIdGenerator;

static OpenTelemetry initOpenTelemetry() {
  return AutoConfiguredOpenTelemetrySdk.builder()
      .addPropertiesSupplier(
          () ->
              Map.of(...config options...)
      .addTracerProviderCustomizer(
          (tracerProviderBuilder, configProperties) -> {
            tracerProviderBuilder.setIdGenerator(SnowflakeTraceIdGenerator.INSTANCE);
            return tracerProviderBuilder;
          })
      .build()
      .getOpenTelemetrySdk();
Copy

For more information about installing com.snowflake.telemetry, see Setting up your Java and Scala environment to use the Telemetry class.

A trace ID generator can be implemented for any other programming language as well. The 16-byte ID (big endian) must contain a timestamp in the four highest-order bytes. The other bytes should contain random bits. For more information, see Python reference implementation.

Publishing Prometheus application metrics

Snowflake supports Prometheus metrics where instead of pushing OTLP metrics, your application might expose Prometheus metrics to be polled by a Snowflake-provided collector. For Snowflake to collect these application metrics from your service and publish them to the event table, follow these steps:

  • Have your service listen on a port, which exposes your Prometheus metrics.

  • Include in your service a Snowflake-provided container (also referred to as “sidecar” container), with necessary configuration to pull the metrics from your service container.

The Prometheus sidecar pulls the application metrics from the container at a scheduled frequency, converts the Prometheus format to OTLP format, and pushes the metrics to the OTel collector. The OTel collector then publishes those metrics into the event table configured for your Snowflake account.

Note

Snowflake doesn’t support Prometheus Summary metric type, as it is deprecated by OpenTelemetry. Use the Histogram type instead.

You add the Prometheus sidecar container to the service specification as another container and include an argument to specify the HTTP endpoint exposed by your container, using the following format:

localhost:{PORT}/{METRICS_PATH}, {SCRAPE_FREQUENCY}

It specifies a port number, path, and frequency at which the sidecar should pull the metrics.

An example service specification fragment shows the sidecar container scraping metrics every minute from your service container from port 8000 and pulling metrics from the path “/metrics”:

spec:
  containers:
  - name: <name>
    image: <image-name>
    .....
  - name: prometheus
    image: /snowflake/images/snowflake_images/monitoring-prometheus-sidecar:0.0.1
    args:
      - "-e"
      - "localhost:8000/metrics,1m"
Copy

In the specification:

  • image is the Snowflake-provided sidecar container image.

  • args provides necessary configuration for the prometheus container to scrape metrics:

    • From port 8000 provided by your container. The port is required in this prometheus container configuration.

    • Using path “/metrics”. It is optional. If not specified, “/metrics” is the default path.

    • Every minute. It is optional. If not specified, “1m” is the default.

    If you leverage the defaults, this is the equivalent configuration for scraping metrics:

    spec:
        ...
        args:
          - "-e"
          - "localhost:8000"
    
    Copy

Note

The Prometheus sidecar container is only supported for services (not jobs). If you want to collect application metrics for a job, it must push the metrics to the OTel collector.

Accessing application metrics and traces in the event table

You can query the event table to retrieve application metrics. The following query retrieves the application metrics collected in the past hour.

SELECT timestamp, record:metric.name, value
  FROM <current_event_table_for_your_account>
  WHERE timestamp > dateadd(hour, -1, CURRENT_TIMESTAMP())
    AND resource_attributes:"snow.service.name" = <service_name>
    AND scope:"name" != 'snow.spcs.platform'
    AND record_type = 'METRIC'
  ORDER BY timestamp DESC
  LIMIT 10;
Copy

For more information about event tables, see Event table overview. You can visualize these metrics in Snowflake dashboards.

You can also query your event table to view the application traces. For example, to retrieve application traces from the past hour, in the preceding query, replace the record_type condition as follows:

AND record_type = 'SPAN' OR record_type = 'SPAN_EVENT'
Copy

Traces can be visualized in the Snowflake trail viewer.

Metrics and traces contain both user-defined and Snowflake-defined attributes as resource and record attributes. Note that the snow. prefix is reserved for Snowflake-generated attributes, Snowflake ignores custom attributes that use this prefix. To see a list of Snowflake defined attributes see Available platform metrics.

Example code is provided in both Python and Java that demonstrates instrumenting an application with custom metrics and traces using the OTLP SDK. The examples show how to configure Snowflake Trace ID generation for compatibility with the Snowflake trail viewer for traces.