Monitor Openflow

This topic describes how to monitor the state of Openflow and troubleshoot problems.

Accessing Openflow logs

Snowflake sends Openflow logs to the event table you configured when you set up Openflow. Snowflake recommends that you include a timestamp in the WHERE clause of event table queries. This is particularly important because of the potential volume of data generated by various Snowflake components. By applying filters, you can retrieve a smaller subset of data, which improves query performance.

The event table includes the following columns, which provide useful information regarding the logs collected by Snowflake from Openflow:

  • TIMESTAMP: Shows when Snowflake collected the log.

  • RESOURCE_ATTRIBUTES: Provides a JSON object that identifies the Snowflake service that generated the log message. For example, it provides details such as the application and data plane ID for Openflow.

    {
    "application": "openflow",
    "cloud.service.provider": "aws",
    "k8s.container.name": "pg-dev-server",
    "k8s.container.restart_count": "0",
    "k8s.namespace.name": "runtime-pg-dev",
    "k8s.node.name": "ip-10-10-62-36.us-east-2.compute.internal",
    "k8s.pod.name": "pg-dev-0",
    "k8s.pod.start_time": "2025-04-25T22:14:29Z",
    "k8s.pod.uid": "94610175-1685-4c8f-b0a1-42898d1058e6",
    "k8s.statefulset.name": "pg-dev",
    "openflow.dataplane.id": "abeddb4f-95ae-45aa-95b1-b4752f30c64a"
    }
    
    Copy
  • RECORD_ATTRIBUTES: For a Snowflake service, it identifies an error source (standard output or standard error).

    {
    "log.file.path": "/var/log/pods/runtime-pg-dev_pg-dev-0_94610175-1685-4c8f-b0a1-42898d1058e6/pg-dev-server/0.log",
    "log.iostream": "stdout",
    "logtag": "F"
    }
    
    Copy
  • VALUE: Standard output and standard error are split into lines, with each line generating its own record in the event table.

    "{\"timestamp\":1746655642080,\"nanoseconds\":80591397,\"level\":\"INFO\",\"threadName\":\"Clustering Tasks Thread-2\",\"loggerName\":\"org.apache.nifi.controller.cluster.ClusterProtocolHeartbeater\",\"formattedMessage\":\"Heartbeat created at 2025-05-07T22:07:22.071Z and sent to pg-dev-0.pg-dev.runtime-pg-dev.svc.cluster.local:8445 at 2025-05-07T22:07:22.080590784Z; determining Cluster Coordinator took 7 millis; DNS lookup for coordinator took 0 millis; connecting to coordinator took 0 millis; sending heartbeat took 1 millis; receiving first byte from response took 1 millis; receiving full response took 1 millis; total time was 9 millis\",\"throwable\":null}"
    

Examples

Find error-level logs for runtimes

SELECT
    timestamp,
    resource_attributes:"k8s.namespace.name" AS runtime_key,
    parse_json(value::string):loggerName AS logger,
    parse_json(value::string):formattedMessage AS log_value
FROM openflow.telemetry.EVENTS_<account-id>
WHERE true
AND timestamp < dateadd('days', -1, sysdate())
AND record_type = 'LOG'
AND resource_attributes:"k8s.namespace.name" LIKE 'runtime-%'
AND resource_attributes:"k8s.container.name" LIKE '%-server'
AND parse_json(value::string):level = 'ERROR'
ORDER BY timestamp desc
LIMIT 5;
Copy

Find “caused by” exceptions in the logs

These exceptions can be expected for intermittent connection issues, data incompatibilities, or related causes.

SELECT
    timestamp,
    RESOURCE_ATTRIBUTES:"k8s.namespace.name" AS Namespace,
    RESOURCE_ATTRIBUTES:"k8s.pod.name" AS Pod,
    RESOURCE_ATTRIBUTES:"k8s.container.name" AS Container,
    value
FROM openflow.telemetry.EVENTS_<account-id>
WHERE true
AND record_type = 'LOG'
AND timestamp > dateadd(minute, -5, sysdate())
AND value LIKE '%Caused By%'
ORDER BY timestamp desc
LIMIT 10;
Copy

Find which processors are running, have stopped, or are in other states

SELECT
    timestamp,
    RECORD_ATTRIBUTES:component AS Processor,
    RECORD_ATTRIBUTES:id AS Processor_ID,
    value AS Running
FROM openflow.telemetry.EVENTS_<account-id>
WHERE true
AND RECORD:metric:name = 'processor.run.status.running'
AND RECORD_TYPE='METRIC'
AND timestamp > dateadd(minute, -5, sysdate());
Copy

Find high CPU usage for runtimes

Slow data flows or reduced throughput may be the result of a bottleneck on the CPU. Runtimes should scale up automatically, based on the number of minimum and maximum nodes you have configured. If the runtimes are using their maximum number of nodes and still CPU usage remains high, increase the maximum number of nodes allocated to your runtimes or troubleshoot the connector or flow.

SELECT
    timestamp,
    RESOURCE_ATTRIBUTES:"k8s.namespace.name" AS Namespace,
    RESOURCE_ATTRIBUTES:"k8s.pod.name" AS Pod,
    RESOURCE_ATTRIBUTES:"k8s.container.name" AS Container,
    value AS CPU_Usage
FROM openflow.telemetry.EVENTS_<account-id>
WHERE TIMESTAMP > dateadd(minute, -1, sysdate())
AND RECORD_TYPE = 'METRIC'
AND RECORD:metric:name ilike 'container.cpu.usage'
AND RESOURCE_ATTRIBUTES:"k8s.namespace.name" ilike 'runtime%'
AND RESOURCE_ATTRIBUTES:"k8s.container.name" ilike '%server'
AND RESOURCE_ATTRIBUTES:"k8s.namespace.name" NOT IN ('runtime-infra', 'runtime-operator')
ORDER BY TIMESTAMP desc, CPU_Usage desc;
Copy

Available metrics

Metrics available in runtimes

The following is a list of available metrics for runtimes:

Metric

Unit

Type

Description

cores.load

percentage

gauge

Average load across all cores available to the runtime. Max value is 1, when all available cores are being fully used.

cores.available

CPU cores

gauge

Number of CPU cores available to the runtime

storage.free

bytes

gauge

Amount of free storage available per storage type to the runtime. There are three storage types available:

  • flowfile

  • content

  • provenance

You can view the storage.type metric in the RECORD_ATTRIBUTES column.

storage.used

bytes

gauge

Amount of storage used per storage type. There are three storage types available:

  • flowfile

  • content

  • provenance

You can view the storage.type metric in the RECORD_ATTRIBUTES column.

Sample query for CPU metrics

SELECT * from events
WHERE
1 = 1
AND record_type = 'METRIC'
AND resource_attributes:application = 'openflow'
AND record:metric.name IN ('cores.load', 'cores.available')
ORDER BY timestamp desc
LIMIT 1000;
Copy

Metrics available in connectors

The following is a list of available metrics for connectors:

Metric

Unit

Type

Description

processgroup.bytes.received

bytes

gauge

Average number of bytes consumed from the source

processgroup.bytes.sent

bytes

gauge

Average number of bytes written to the destination

To query these metrics from the event table, you need to find the process group name and ID from the Openflow runtime canvas and then filter it from the RECORD_ATTRIBUTES column.