Snowflake AI Observability Reference

This document provides a comprehensive reference for using Snowflake Cortex AI Observability to evaluate and monitor the performance of your generative AI applications.

It covers the following concepts:

  • Datasets and attributes

  • Evaluation metrics

  • Runs

  • Access control and storage

Dataset and attributes

A dataset is a set of inputs that you use to test the application. It can also contain a set of expected outputs (the ground truth).

You can use the TruLens Python SDK to specify the dataset as either a Snowflake table or a pandas dataframe. Each column in the dataset must be mapped to one of the following reserved attributes:

Reserved attributes

Input attribute

Description

RECORD_ROOT.INPUT

Input prompt to the LLM.

Type: string

RECORD_ROOT.INPUT_ID

Unique identifier for the input prompt.

If you don’t provide an input ID, an ID is automatically generated and assigned to each input.

Type: string

RETRIEVAL.QUERY_TEXT

User query for a RAG application

Type: string

RECORD_ROOT.GROUND_TRUTH_OUTPUT

Expected response for the input prompt.

Type: string

For instrumenting the application, you must map the input and output parameters for the instrumented function (or method) to the relevant input and output attributes. Use the @instrument decorator to map the parameters and compute the metrics. In addition to the input attributes specified as part of the dataset, you can also use the following output attributes to instrument the relevant functions:

Output attributes

Output attribute

Description

RETRIEVAL.RETRIEVED_CONTEXTS

Output generated by the LLM.

Type: List [string]

RECORD_ROOT.OUTPUT

Generated response from the LLM.

Type: string

Evaluation metrics

Evaluation metrics provide a quantifiable way to measure the accuracy and performance of your application. These metrics are computed using specific inputs to the application, LLM-generated outputs, and any intermediate information (such as retrieved results from a RAG application). You can also compute metrics using a ground truth dataset.

You can compute metrics with the “LLM-as-a-judge” approach. With this approach, an LLM is used to generate a score (between 0 - 1) with an explanation for the application’s output. based on the provided information. You can select any LLM available in Cortex AI as judges. If no LLM judge is specified, llama3.1-70b is used as the default judge. AI Observability supports a variety of evaluation metrics.

Context Relevance

Context Relevance determines if the retrieved context from the retriever or the search service is relevant to the user query. Given the user query and retrieved context, an LLM judge is used to determine relevance of the retrieved context based on the query.

Required Attributes:

  • RETRIEVAL.QUERY_TEXT: User query in a RAG or search application

  • RETRIEVAL.RETRIEVED_CONTEXTS: Context retrieved from the search service or retriever

Groundedness

Groundedness determines if the generated response is supported by and grounded in the retrieved context from the retriever or the search service. Given the generated response and retrieved context, an LLM judge is used to determine groundedness. The underlying implementation uses Chain-of-thought reasoning when generating the groundedness scores.

Required Attributes:

  • RETRIEVAL.RETRIEVED_CONTEXTS: User query in a RAG or search application

  • RECORD_ROOT.OUTPUT: Final response generated by the LLM

Answer Relevance

Answer relevance determines if the generated response is relevant to the user query. Given the user query and generated response, an LLM judge is used to determine how relevant the response is when answering the user’s query. Note that this doesn’t rely on ground truth answer reference, and therefore this is not equivalent to assessing answer correctness.

Required Attributes:

  • RECORD_ROOT.INPUT: User query in a RAG or search application

  • RECORD_ROOT.OUTPUT: Final response generated by the LLM

Correctness

Correctness determines how aligned the generated response is with the ground truth. A higher correctness score indicates a more accurate response with larger alignment with the ground truth.

Required Attributes:

  • RECORD_ROOT.INPUT: User query or prompt to the LLM

  • RECORD_ROOT.GROUND_TRUTH_OUTPUT: Expected response based on the user query

  • RECORD_ROOT.OUTPUT: Response generated by the LLM

Coherence

Coherence measures if the generated response of the model is coherent and doesn’t introduce logical gaps, inconsistencies or contradictions. A higher coherence score indicates a highly coherent response.

Required Attributes:

  • RECORD_ROOT.OUTPUT: Response generated by the LLM

Cost and Latency

Usage cost

Cost is calculated for each LLM invocation call that relies on Cortex LLMs based on the token usage information (prompt_tokens for input and completion_tokens for output) returned by the COMPLETE (SNOWFLAKE.CORTEX) function. As part of the trace information, you can view the token usage and the corresponding costs associated with each LLM call.

Latency

Latency is determined by measuring the time taken to complete each function call in the application. Application traces provide granular visibility into the latency of each function instrumented using the TruLens SDK. Individual function latencies are aggregated to compute the overall latency of the entire application corresponding to each input. Each run also provides an average latency across all inputs for easy comparison across multiple application configurations.

Runs

A run is an evaluation task used to measure the accuracy and performance of an application. It helps you select the best application configuration. Building a generative AI application involves experimenting with various LLMs, prompts, and inference parameters. You measure their accuracy, latency, and usage to find the optimal combination for production. Each combination corresponds to an application version.

A run uses the dataset that you specify to execute a batch evaluation for an application version. You can trigger multiple runs with the same dataset for different versions. You can compare the aggregated and record-level differences between the versions to identify improvements that you need to make and select the best version to deploy.

Creating and executing a run involves four main steps:

  1. Creation: After creating an application and a version, add a new run for the version by specifying a dataset.

  2. Invocation: Start the run, which reads inputs from the dataset, invokes the application for each input, generates traces, and stores the information in your Snowflake account.

  3. Computation: After invocation, trigger computation by specifying metrics to be computed. You can trigger multiple computations and add new metrics later for an existing run.

  4. Visualization: Visualize the run results in Snowsight by logging into your Snowflake account. Runs are listed within their relevant applications in AI & ML under Evaluations.

You can label each run to categorize comparable runs between different application versions with the same dataset. Use the labels to manager and filter the runs.

A run can have one of the following statuses:

Run status

Status

Description

CREATED

The run has been created but not started.

INVOCATION_IN_PROGRESS

The run invocation is in the process of generating the output and the traces.

INVOCATION_COMPLETED

The run invocation completed with all outputs and traces created.

INVOCATION_PARTIALLY_COMPLETED

The run invocation is partially completed due to failures in application invocation and trace generation.

COMPUTATION_IN_PROGRESS

The metric computation is in progress.

COMPLETED

The metric computation is completed with detailed outputs and traces.

PARTIALLY_COMPLETED

The run is partially completed due to failures during the metric computation.

CANCELLED

The run has been cancelled.

Access control and storage

Required privileges

You need the following privileges to use AI Observability.

  • To use AI Observability, your role must have the CORTEX_USER database role. The CORTEX_USER role is required for database functions. For information on granting and revoking this role, see Required privileges.

  • To register an application, your role must have CREATE EXTERNAL AGENT privileges on the schema. For more information, see Applications.

  • To create and execute runs, your role must have USAGE privileges on the EXTERNAL AGENT object representing the application and AI_OBSERVABILITY_EVENTS_LOOKUP or AI_OBSERVABILITY_ADMIN application role. For more information, see Runs and Observability data.

Applications

Creating an application for evaluation creates an EXTERNAL AGENT object to represent the application in Snowflake. The role required to create and modify an application must have the following access control requirements.

A role used to create an application must have the following privileges at a minimum:

Privilege

Object

Notes

OWNERSHIP

External Agent

OWNERSHIP is a special privilege on an object that is automatically granted to the role that created the object, but can also be transferred using the GRANT OWNERSHIP command to a different role by the owning role (or any role with the MANAGE GRANTS privilege).

CREATE EXTERNAL AGENT

Schema

The USAGE privilege on the parent database and schema are required to perform operations on any object in a schema.

Modifying and deleting the application require OWNERSHIP privileges on the EXTERNAL AGENT object.

If a user’s role has USAGE or OWNERSHIP privileges on an application (EXTERNAL AGENT), the application appears in Evaluations under AI & ML within Snowsight.

Runs

A role used to add, modify or delete a run to an application must have USAGE privileges on the EXTERNAL AGENT object representing the application in Snowflake.

Deleting a run deletes the metadata associated with the run. The records created as part of the run are not deleted and remain stored. Please see Observability Data for more information on storage of the records and traces.

For instructions on creating a custom role with a specified set of privileges, see Creating custom roles. For general information about roles and privilege grants for performing SQL actions on securable objects, see Overview of Access Control.

LLMs as judges

AI Observability uses Cortex LLMs as judges to compute the metrics for evaluating your applications. To successfully compute these metrics, you need permissions to access Cortex LLMs. To grant user roles access to Cortex LLMs, please see required privileges. The user must have access to the model configured as the LLM judge. The default model used for LLM judge is llama3.1-70b. The default LLM judge model is subject to change in the future.

Observability data

AI Observability data represents records containing inputs, outputs, evaluation scores, and associated traces for your generative AI applications. All the records are stored in a dedicated events table AI_OBSERVABILITY_EVENTS in your account under SNOWFLAKE.LOCAL schema.

AI observability data ingested into the event table cannot be modified. Administrators with the AI_OBSERVABILITY_ADMIN application role have exclusive access to delete the data in the SNOWFLAKE.LOCAL.AI_OBSERVABILITY_EVENTS event table.

AI observability data can be accessed using the Trulens Python SDK or using Snowsight. The following privileges are required to view the records for an application and associated runs:

  • The user role must have application role SNOWFLAKE.AI_OBSERVABILITY_ADMIN or SNOWFLAKE.AI_OBSERVABILITY_EVENTS_LOOKUP

  • The user role must have the USAGE privilege on the EXTERNAL AGENT object that represents the application.

For example, to view the runs for an externally instrumented RAG application, the user role requires the USAGE privilege on “my-db.my-schema.rag-application1”, where rag-application1 is the EXTERNAL AGENT object that represents the external RAG application in Snowflake.

The metadata associated with runs and external agents (such as Run name, description, dataset name etc) are classified as metadata.