Snowflake AI Observability Reference¶
This document provides a comprehensive reference for using Snowflake Cortex AI Observability to evaluate and monitor the performance of your generative AI applications.
It covers the following concepts:
Datasets and attributes
Evaluation metrics
Runs
Access control and storage
Dataset and attributes¶
A dataset is a set of inputs that you use to test the application. It can also contain a set of expected outputs (the ground truth).
You can use the TruLens Python SDK to specify the dataset as either a Snowflake table or a pandas dataframe. Each column in the dataset must be mapped to one of the following reserved attributes:
Input attribute |
Description |
---|---|
RECORD_ROOT.INPUT |
Input prompt to the LLM. Type: string |
RECORD_ROOT.INPUT_ID |
Unique identifier for the input prompt. If you don’t provide an input ID, an ID is automatically generated and assigned to each input. Type: string |
RETRIEVAL.QUERY_TEXT |
User query for a RAG application Type: string |
RECORD_ROOT.GROUND_TRUTH_OUTPUT |
Expected response for the input prompt. Type: string |
For instrumenting the application, you must map the input and output parameters for the instrumented function (or method) to the relevant input and output attributes. Use the @instrument
decorator to map the parameters and compute the metrics. In addition to the input attributes specified as part of the dataset, you can also use the following output attributes to instrument the relevant functions:
Output attribute |
Description |
---|---|
RETRIEVAL.RETRIEVED_CONTEXTS |
Output generated by the LLM. Type: List [string] |
RECORD_ROOT.OUTPUT |
Generated response from the LLM. Type: string |
Evaluation metrics¶
Evaluation metrics provide a quantifiable way to measure the accuracy and performance of your application. These metrics are computed using specific inputs to the application, LLM-generated outputs, and any intermediate information (such as retrieved results from a RAG application). You can also compute metrics using a ground truth dataset.
You can compute metrics with the “LLM-as-a-judge” approach. With this approach, an LLM is used to generate a score (between 0 - 1) with an explanation for the application’s output. based on the provided information. You can select any LLM available in Cortex AI as judges. If no LLM judge is specified, llama3.1-70b is used as the default judge. AI Observability supports a variety of evaluation metrics.
Context Relevance¶
Context Relevance determines if the retrieved context from the retriever or the search service is relevant to the user query. Given the user query and retrieved context, an LLM judge is used to determine relevance of the retrieved context based on the query.
Required Attributes:
RETRIEVAL.QUERY_TEXT
: User query in a RAG or search applicationRETRIEVAL.RETRIEVED_CONTEXTS
: Context retrieved from the search service or retriever
Groundedness¶
Groundedness determines if the generated response is supported by and grounded in the retrieved context from the retriever or the search service. Given the generated response and retrieved context, an LLM judge is used to determine groundedness. The underlying implementation uses Chain-of-thought reasoning when generating the groundedness scores.
Required Attributes:
RETRIEVAL.RETRIEVED_CONTEXTS
: User query in a RAG or search applicationRECORD_ROOT.OUTPUT
: Final response generated by the LLM
Answer Relevance¶
Answer relevance determines if the generated response is relevant to the user query. Given the user query and generated response, an LLM judge is used to determine how relevant the response is when answering the user’s query. Note that this doesn’t rely on ground truth answer reference, and therefore this is not equivalent to assessing answer correctness.
Required Attributes:
RECORD_ROOT.INPUT
: User query in a RAG or search applicationRECORD_ROOT.OUTPUT
: Final response generated by the LLM
Correctness¶
Correctness determines how aligned the generated response is with the ground truth. A higher correctness score indicates a more accurate response with larger alignment with the ground truth.
Required Attributes:
RECORD_ROOT.INPUT
: User query or prompt to the LLMRECORD_ROOT.GROUND_TRUTH_OUTPUT
: Expected response based on the user queryRECORD_ROOT.OUTPUT
: Response generated by the LLM
Coherence¶
Coherence measures if the generated response of the model is coherent and doesn’t introduce logical gaps, inconsistencies or contradictions. A higher coherence score indicates a highly coherent response.
Required Attributes:
RECORD_ROOT.OUTPUT
: Response generated by the LLM
Cost and Latency¶
Usage cost¶
Cost is calculated for each LLM invocation call that relies on Cortex LLMs based on the token usage information (prompt_tokens for input and completion_tokens for output) returned by the COMPLETE (SNOWFLAKE.CORTEX) function. As part of the trace information, you can view the token usage and the corresponding costs associated with each LLM call.
Latency¶
Latency is determined by measuring the time taken to complete each function call in the application. Application traces provide granular visibility into the latency of each function instrumented using the TruLens SDK. Individual function latencies are aggregated to compute the overall latency of the entire application corresponding to each input. Each run also provides an average latency across all inputs for easy comparison across multiple application configurations.
Runs¶
A run is an evaluation task used to measure the accuracy and performance of an application. It helps you select the best application configuration. Building a generative AI application involves experimenting with various LLMs, prompts, and inference parameters. You measure their accuracy, latency, and usage to find the optimal combination for production. Each combination corresponds to an application version.
A run uses the dataset that you specify to execute a batch evaluation for an application version. You can trigger multiple runs with the same dataset for different versions. You can compare the aggregated and record-level differences between the versions to identify improvements that you need to make and select the best version to deploy.
Creating and executing a run involves four main steps:
Creation: After creating an application and a version, add a new run for the version by specifying a dataset.
Invocation: Start the run, which reads inputs from the dataset, invokes the application for each input, generates traces, and stores the information in your Snowflake account.
Computation: After invocation, trigger computation by specifying metrics to be computed. You can trigger multiple computations and add new metrics later for an existing run.
Visualization: Visualize the run results in Snowsight by logging into your Snowflake account. Runs are listed within their relevant applications in AI & ML under Evaluations.
You can label each run to categorize comparable runs between different application versions with the same dataset. Use the labels to manager and filter the runs.
A run can have one of the following statuses:
Status |
Description |
---|---|
CREATED |
The run has been created but not started. |
INVOCATION_IN_PROGRESS |
The run invocation is in the process of generating the output and the traces. |
INVOCATION_COMPLETED |
The run invocation completed with all outputs and traces created. |
INVOCATION_PARTIALLY_COMPLETED |
The run invocation is partially completed due to failures in application invocation and trace generation. |
COMPUTATION_IN_PROGRESS |
The metric computation is in progress. |
COMPLETED |
The metric computation is completed with detailed outputs and traces. |
PARTIALLY_COMPLETED |
The run is partially completed due to failures during the metric computation. |
CANCELLED |
The run has been cancelled. |
Access control and storage¶
Required privileges¶
You need the following privileges to use AI Observability.
To use AI Observability, your role must have the CORTEX_USER database role. The CORTEX_USER role is required for database functions. For information on granting and revoking this role, see Required privileges.
To register an application, your role must have CREATE EXTERNAL AGENT privileges on the schema. For more information, see Applications.
To create and execute runs, your role must have USAGE privileges on the EXTERNAL AGENT object representing the application and AI_OBSERVABILITY_EVENTS_LOOKUP or AI_OBSERVABILITY_ADMIN application role. For more information, see Runs and Observability data.
Applications¶
Creating an application for evaluation creates an EXTERNAL AGENT object to represent the application in Snowflake. The role required to create and modify an application must have the following access control requirements.
A role used to create an application must have the following privileges at a minimum:
Privilege |
Object |
Notes |
---|---|---|
OWNERSHIP |
External Agent |
OWNERSHIP is a special privilege on an object that is automatically granted to the role that created the object, but can also be transferred using the GRANT OWNERSHIP command to a different role by the owning role (or any role with the MANAGE GRANTS privilege). |
CREATE EXTERNAL AGENT |
Schema |
The USAGE privilege on the parent database and schema are required to perform operations on any object in a schema.
Modifying and deleting the application require OWNERSHIP privileges on the EXTERNAL AGENT object.
If a user’s role has USAGE or OWNERSHIP privileges on an application (EXTERNAL AGENT), the application appears in Evaluations under AI & ML within Snowsight.
Runs¶
A role used to add, modify or delete a run to an application must have USAGE privileges on the EXTERNAL AGENT object representing the application in Snowflake.
Deleting a run deletes the metadata associated with the run. The records created as part of the run are not deleted and remain stored. Please see Observability Data for more information on storage of the records and traces.
For instructions on creating a custom role with a specified set of privileges, see Creating custom roles. For general information about roles and privilege grants for performing SQL actions on securable objects, see Overview of Access Control.
LLMs as judges¶
AI Observability uses Cortex LLMs as judges to compute the metrics for evaluating your applications. To successfully compute these metrics, you need permissions to access Cortex LLMs. To grant user roles access to Cortex LLMs, please see required privileges. The user must have access to the model configured as the LLM judge. The default model used for LLM judge is llama3.1-70b. The default LLM judge model is subject to change in the future.
Observability data¶
AI Observability data represents records containing inputs, outputs, evaluation scores, and associated traces for your generative AI applications. All the records are stored in a dedicated events table AI_OBSERVABILITY_EVENTS in your account under SNOWFLAKE.LOCAL schema.
AI observability data ingested into the event table cannot be modified. Administrators with the AI_OBSERVABILITY_ADMIN application role have exclusive access to delete the data in the SNOWFLAKE.LOCAL.AI_OBSERVABILITY_EVENTS event table.
AI observability data can be accessed using the Trulens Python SDK or using Snowsight. The following privileges are required to view the records for an application and associated runs:
The user role must have application role SNOWFLAKE.AI_OBSERVABILITY_ADMIN or SNOWFLAKE.AI_OBSERVABILITY_EVENTS_LOOKUP
The user role must have the USAGE privilege on the EXTERNAL AGENT object that represents the application.
For example, to view the runs for an externally instrumented RAG application, the user role requires the USAGE privilege on “my-db.my-schema.rag-application1”, where rag-application1 is the EXTERNAL AGENT object that represents the external RAG application in Snowflake.
The metadata associated with runs and external agents (such as Run name, description, dataset name etc) are classified as metadata.