AI Observability in Snowflake Cortex¶
Use AI Observability in Snowflake Cortex to evaluate and trace your generative AI applications. With AI Observability, you can make your applications more trustworthy and transparent. Use it to measure the performance of your AI applications by running systematic evaluations. You can use the information from the evaluations to iterate on your application configurations and optimize performance. You can also use it to log application traces for debugging purposes.
Use AI Observability to benchmark performance, thus making your applications trustworthy and providing greater confidence for production deployments.
AI Observability has the following features:
Evaluations: Use AI Observability to systematically evaluate the performance of your generative AI applications and agents using the LLM-as-a-judge technique. You can use metrics, such as accuracy, latency, usage, and cost, to quickly iterate on your application configurations and optimize performance.
Comparison: Compare multiple evaluations side by side and assess the quality and accuracy of responses. You can analyze the responses across different LLMs, prompts, and inference configurations to identify the best configuration for production deployments.
Tracing: Trace every step of application executions across input prompts, retrieved context, tool use, and LLM inference. Use it to debug individual records and refine the app for accuracy, latency, and cost.
AI Observability can be used to evaluate a variety of task types, such as retrieval-augmented generation (RAG) and summarization. For example, the context relevance score can help you detect the quality of the search results retrieval corresponding to a user query. You can use the answer relevance and groundedness scores to detect the truthfulness and relevance of the final response based on the retrieved context.
For summarization, you can measure the factual correctness and comprehensiveness of the LLM-generated summaries based on original input and avoid prompts and LLMs that have a higher frequency of hallucinations in your generative AI applications.
To get started, learn about the Key concepts, and then take a quick walkthrough with the AI Observability Tutorial. You can then use the information in Evaluate AI applications for an in-depth walkthrough.
To review a specific concept, see the Snowflake AI Observability Reference.
Access control and prerequisites¶
Before you start using AI Observability:
Ensure that the user role has the following roles granted (For more information, see Required privileges):
CORTEX_USER
database roleAI_OBSERVABILITY_EVENTS_LOOKUP
application roleCREATE EXTERNAL AGENT
privilege on the schema
Install the following Trulens Python packages in your Python project:
trulens-core
trulens-connectors-snowflake
trulens-providers-cortex
Before you connect to Snowflake, set the
TRULENS_OTEL_TRACING
environment variable to1
in your Python environment.
TruLens is the platform that Snowflake uses to track your applications. For more information, see the TruLens documentation.
Key concepts¶
Applications¶
An application is an end-to-end generative AI application that is designed using multiple components such as LLMs, tools (such as search retrievers or APIs), and additional custom logic. For example, an application can contain a RAG pipeline with retrievers, re-rankers, and LLMs chained together. You can enable AI observability for applications that can run in any environment (such as Snowflake, cloud, or on-premises). Applications are represented in Snowflake as the EXTERNAL AGENT object type. For information about registering a generative AI application, see Evaluate AI applications.
Versions¶
Applications can have multiple versions. Each version represents a different implementation. For example, these versions can represent different retrievers, prompts, LLMs or inference configurations.
Dataset¶
A dataset represents a set of inputs. You can configure it to also represent a set of expected outputs (the ground truth) to test the application. Using the dataset, you can invoke the application to do the following tasks:
Generate the output.
Capture the traces.
Compute evaluation metrics.
You can use a dataset containing both the inputs and the generated outputs to compute the evaluation metrics without invoking the application. For a list of fields supported in the dataset, see Dataset and attributes.
Runs¶
A run is an evaluation job. It uses the dataset and the application version that you’ve specified to compute evaluation metrics.
A run has an invocation stage and a computation stage. The invocation stage triggers the application to generate the output and corresponding traces. The computation stage computes the evaluation metrics specified for the run. Multiple computations can be performed to add new metrics to an existing run. For the list of statuses associated with the execution of a run, see Runs.
Metrics¶
Evaluation metrics are scores that you use to assess generative AI application performance based on your own criteria. These metrics use LLMs to grade outputs and provide detailed scoring information. For a comprehensive list of metrics and their definitions, see Evaluation metrics.
Traces¶
Traces are comprehensive records that capture the inputs, outputs, and intermediate steps of the interactions with an LLM application. Traces provide a detailed view of the application’s execution. Use traces to analyze and understand the model’s behavior at each stage. You can compare the traces of different application versions to identify improvements, debug issues, and verify intended performance. For information about accessing traces associated with each record, see Evaluate AI applications.
Pricing¶
AI Observability uses LLM judges to compute the evaluation metrics. For server-side evaluations, LLMs on Cortex AI are used as LLM judges. The LLM judges are invoked via the COMPLETE (SNOWFLAKE.CORTEX) function to perform evaluations. You incur charges for the Cortex Complete function calls. The LLM used to perform the evaluations determines how much you’re charged. Additionally, you’re charged the following:
Warehouse charges for queries used to compute evaluation metrics
Storage charges for the evaluation results
Warehouse charges to retrieve the evaluation results to be viewed in Snowsight