Cortex Agent evaluations¶

Snowflake Cortex Agent evaluations allow you to monitor your agent’s behavior and performance. Track tool selection and execution, and evaluate your agent against both ground truth-based and reference-free evaluation metrics. During evaluation, your agent’s activity is traced and monitored so you can ensure that each step in the process advances towards your end goal.

Snowflake offers the following metrics to evaluate your agent against:

Answer correctness – How closely the answer from an agent to your prepared query matches an expected answer. This metric is most useful when the dataset powering your Cortex Agent is static.
Tool selection accuracy – Whether or not the agent selects the correct tools at the correct stages in response to your prepared query.
Tool execution accuracy – How closely the agent matches expected invocations of tools, and if tool output matches an expected output.
Logical consistency – Measures consistency across agent instructions, planning, and tool calls. This metric is reference-free, meaning you don’t need to prepare any information in your dataset for evaluation.

For additional details on how agent evaluations are conducted on Snowflake, including the LLM judging processes used for reference-free evaluations, see What is Your Agent’s GPA? A Framework for Evaluating Agent Goal-Plan-Action Alignment.

Access control requirements¶

Access to conduct a Cortex Agent evaluation requires the following:

Access to a role with the permissions needed for agent observability. This role is used for running agent evaluations, and must have the following:
- The permissions required for agent monitoring:
  - DATABASE ROLE SNOWFLAKE.CORTEX_USER
  - EXECUTE TASK ON ACCOUNT
  For more information on the agent monitoring permissions, see Snowflake AI Observability reference – Access control.
- The CREATE FILE FORMAT ON SCHEMA permission on the schema containing your agent.
- The CREATE DATASET ON SCHEMA permission on the schema containing your agent.
- The ability to impersonate the user who will run an evaluation. This must be the same as the user logged in through Snowsight who creates and runs the evaluation. For example, to enable impersonation of a user my_user for the role my_observability_role:
  GRANT IMPERSONATE ON USER my_user TO ROLE my_observability_role;
  Copy
The CREATE TASK permission on the schema containing your agent.
The OWNERSHIP or MONITOR privilege on your agent.

Caution

Each user conducting an evaluation should be given their own role to run agent evaluations under for security purposes. The user impersonation requirement will be dropped before public preview.

Prepare an evaluation dataset¶

Note

Conducting reference-free evaluations doesn’t require a dataset.

Before starting a Cortex Agent evaluation, you need to prepare a dataset for your evaluation to run against. This dataset consists of an input query column of type VARCHAR that represents your query, and an output column of type OBJECT that contains a description of expected agent behavior. This single column is your ground truth for evaluation.

This object has two keys:

ground_truth_output – The expected final output of the agent, used in answer correctness evaluation. This value is evaluated on a semantic match, where agent answers that align more closely with your ground truth output receive higher scores.
ground_truth_invocations – The value of this key is an ARRAY containing JSON objects describing a tool invocation, used for the tool selection and tool accuracy metrics. Use the empty array [] to verify that no tools are called by the agent.

`ground_truth_invocations` entry keys¶
Parameter	Description	Used by
`tool_name`	The name of the agent tool expected to run as part of an evaluation.	Tool selection Tool execution
`tool_sequence`	The numbered order in which this tool, with its associated arguments and outputs, is expected to be called.	Tool selection (optional) Tool execution (optional)
`tool_input`	A VARIANT of key-value pairs that map to parameters and values for this tool invocation. For Cortex Analyst and Cortex Search tools, this VARIANT is instead the query string.	Tool execution (optional)
`tool_output`	A VARIANT describing the expected output from the tool. For Cortex Analysts invoked by your agent, you can only inspect the generated SQL as part of the output. This value is contained in the `SQL` key: { "tool_name": "my_analyst", "tool_output": { "SQL": "SELECT * FROM EXPECTED_TABLE" } } Copy For Cortex Search invoked by your agent, you can only inspect the sources searched. This value is an ARRAY containing the name of each source, contained in the `search results` key: { "tool_name": "my_search", "tool_output": { "search results": [ "context searched 1", "context searched 2", "context searched N" ] } } Copy	Tool execution (optional)

For example, if you expect the agent to call get_weather with the inputs of city = "San Francisco" and date == 08/02/2019 and return with a VARIANT of {"temp": "14", "units": "C"}, you would define an entry in the expected outputs array as:

{
  "tool_name": "get_weather",
  "tool_input": {"city": "San Francisco", "date": "08/02/2019"},
  "tool_output": {"temp": "14", "units": "C"}
}

Copy

The following example includes the above expected tool invocation as part of a full ground truth entry, where the expected answer from your agent is The temperature was 14 degrees Celsius in San Francisco on August 2nd, 2019.

{
  "ground_truth_output": "The temperature was 14 degrees Celsius in San Francisco on August 2nd, 2019.",
  "ground_truth_invocations": [
      {
        "tool_name": "get_weather",
        "tool_input": {"city": "San Francisco", "date": "08/02/2019"},
        "tool_output": {"temp": "14", "units": "C"}
      }
  ]
}

Copy

To bring your JSON evaluation dataset into a Snowflake table, use the PARSE_JSON SQL function. The following example creates a new table agent_evaluation_data for the evaluation dataset, and inserts a row for the input query What was the temperature in San Francisco on August 2nd 2019? with the ground truth JSON from the previous example:

CREATE OR REPLACE TABLE agent_evaluation_data (
    input_query VARCHAR,
    ground_truth OBJECT
);

INSERT INTO agent_evaluation_data
  SELECT
    'What was the temperature in San Francisco on August 2nd 2019?',
    PARSE_JSON('
      {
        "ground_truth_output": "The temperature was 14 degrees Celsius in San Francisco on August 2nd, 2019.",
        "ground_truth_invocations": [
            {
              "tool_name": "get_weather",
              "tool_input": {"city": "San Francisco", "date": "08/02/2019"},
              "tool_output": {"temp": "14", "units": "C"}
            }
        ]
      }
    ');

Copy

Note

Data you provide in the ground_truth column that isn’t used by a selected evaluation is ignored.

Start an agent evaluation¶

Note

Agent evaluations run as your currently selected role in Snowsight, not your default role. Make sure a role with the correct permissions is active before starting an evaluation.

Begin your evaluation of a Cortex Agent by doing the following:

Sign in to Snowsight.
In the navigation menu, select AI & ML » Agents.
Select the agent you want to conduct an evaluation of.
Select the Evaluations tab.
Select New evaluation run.

The New evaluation run modal opens.
In the Name field, provide a name for your evaluation. This name should be unique for the agent being evaluated.
Optional: In the Description field, provide any comments for the evaluation.
Select Next.

This advances to the Select dataset modal.
Select the dataset used to evaluate your agent. You can choose either Existing dataset or Create new dataset.
To use an existing dataset:
1. From the Database and schema list, select the database and schema containing your dataset.
2. From the Select dataset list, select your dataset.
To create a new dataset:
1. From the Source table - Database and schema list, select the database and schema containing the table you want to import to a dataset.
2. From the Select source table list, select your source table.
3. From the New dataset location - Database and schema list, select the database and schema to place your new dataset.
4. In the Dataset name field, enter your dataset name. This name needs to be unique among the schema-level objects in your selected schema.
Select Next.

This advances to the Select metrics modal.
From the Input query list, select the column of your dataset which contains the input queries.
For each of the Metrics, change the toggle to active for any metric you want included in your evaluation. Select the column of your dataset containing the ground truth for your evaluation.
Select Create to create the evaluation and begin the evaluation process.

At any point, you can select Cancel to cancel creating the evaluation, or select Prev to return to the previous modal.

Inspect evaluation results¶

Evaluation results include information about the requested metrics, details of the agent’s threads of reasoning, and information about the LLM planning stage for each executed trace in the thread. The Evaluations tab for an agent in Snowsight gives you an overview of every evaluation run and its summary results.

To view evaluation results in Snowsight:

Sign in to Snowsight.
In the navigation menu, select AI & ML » Agents.
Select the agent you want to conduct an evaluation of.
Select the Evaluations tab.

Evaluation runs listing¶

The summary of run information for each run includes:

RUN NAME – The name of the evaluation run.
# OF RECORDS – The number of queries performed and answered as part of the run.
STATUS – The status of the evaluation run, which is one of:
- – All inputs were evaluated and results are available.
- A spinner is displayed – The run is in progress, with no information available yet.
- – The run experienced an error at some point. Some or all metrics may be unavailable for the run.
AVG DURATION – The average duration of time taken to execute an input query for the run.
LOGICAL CONSISTENCY – Average over all inputs of the logical consistency evaluation for the run, if requested.
TOOL EXECUTION ACCURACY – Average over all inputs of the tool execution accuracy evaluation for the run, if requested.
TOOL SELECTION ACCURACY – Average over all inputs of the tool selection accuracy evaluation for the run, if requested.
CREATED – The time at which the run was created and started.

Evaluation run overview¶

When you select an individual run in Snowsight, you’re presented with the run overview. This overview includes summary averages for each metric evaluated during the run, and a summary of each input execution. The overview for each input execution includes:

INPUT – The input query used for the evaluation.
OUTPUT – The output produced by the agent.
DURATION – The length of time taken to process the input and produce output.
LOGICAL CONSISTENCY – The logical consistency evaluation for the input, if requested.
TOOL EXECUTION ACCURACY – The tool execution accuracy evaluation for the input, if requested.
TOOL SELECTION ACCURACY – The tool selection accuracy evaluation for the input, if requested.
EVALUATED – The time at which the input was processed.

Record details¶

When you select an individual input in Snowsight, you’re presented with the Record details view. This view includes three panes: Evaluation results, Thread details, and Trace details.

Evaluation results¶

Your evaluation results are presented here in detail. Each metric has its own presentation box of overall average across inputs, which can be selected to display a popover containing more information. This popover contains a breakdown of the number of runs which performed at high accuracy (80% or more accurate), medium accuracy (30% or more accurate, but not high accuracy), and which failed.

Thread details¶

The information logged during the execution of each agent thread. This includes planning and response generation by default, as well as a thread trace for each tool that the agent invoked during that thread.

Trace details¶

Each trace pane includes input, processing, and output information relevant to that stage of agent execution. This information is the same as that provided by agent monitoring.

Identifying and reporting errors¶

While in private preview, Agent evaluation errors may surface as SQL statement errors without information about the underlying issue. In order to get a detailed report of the error and help us improve our evaluation error messages, contact Snowflake Support with the incident number from your query history.

Cost considerations¶

Using Cortex Agent evaluations can incur highly variable cost depending on the evaluation metrics selected, optional evaluation information you supply, the structure of your agent, and other considerations. During private preview please work directly with your Snowflake representative regarding cost.