Cortex Agent evaluations

Snowflake Cortex Agent evaluations allow you to monitor your agent’s behavior and performance. Track tool selection and execution, and evaluate your agent against both ground truth-based and reference-free evaluation metrics. During evaluation, your agent’s activity is traced and monitored so you can ensure that each step in the process advances towards your end goal.

Snowflake offers the following metrics to evaluate your agent against:

  • Answer correctness – How closely the answer from an agent to your prepared query matches an expected answer. This metric is most useful when the dataset powering your Cortex Agent is static.

  • Tool selection accuracy – Whether or not the agent selects the correct tools at the correct stages in response to your prepared query.

  • Tool execution accuracy – How closely the agent matches expected invocations of tools, and if tool output matches an expected output.

  • Logical consistency – How consistent the answers produced by your agent are. This metric is reference-free, meaning you don’t need to prepare any information in your dataset for evaluation.

For additional details on how agent evaluations are conducted on Snowflake, including the LLM judging processes used for reference-free evaluations, see What is Your Agent’s GPA? A Framework for Evaluating Agent Goal-Plan-Action Alignment.

Access control requirements

Access to conduct a Cortex Agent evaluation requires the following:

  • Access to a role with the permissions needed for agent observability. This role is used for running agent evaluations, and must have the following:

    • The permissions required for agent monitoring. For the complete list, see Snowflake AI Observability reference – Access control.

    • The CREATE FILE FORMAT permission on the schema containing the agent.

    • The EXECUTE TASK ON ACCOUNT privilege.

    • The ability to impersonate the user who will run an evaluation. This must be the same as the user logged in through Snowsight who creates and runs the evaluation. For example, to enable impersonation of a user my_user for the role my_observability_role:

      GRANT IMPERSONATE ON USER my_user TO ROLE my_observability_role;
      
      Copy
  • The CREATE TASK permission on the schema containing the agent.

  • The OWNERSHIP or MONITOR privilege on the agent being evaluated.

Prudence

Each user conducting an evaluation should be given their own role to run agent evaluations under for security purposes. The user impersonation requirement will be dropped before public preview.

Prepare an evaluation dataset

Note

Conducting reference-free evaluations doesn’t require a dataset.

Before starting a Cortex Agent evaluation, you need to prepare a dataset for your evaluation to run against. This dataset consists of an input query column of type VARCHAR that represents your query, and an output column of type OBJECT that contains a description of expected agent behavior. This single column is your ground truth for evaluation.

This object has two keys:

  • ground_truth_output – The expected final output of the agent, used in answer correctness evaluation. This value is evaluated on a semantic match, where agent answers that align more closely with your ground truth output receive higher scores.

  • ground_truth_invocations – The value of this key is an ARRAY containing JSON objects describing a tool invocation, used for the tool selection and tool accuracy metrics.

ground_truth_invocations entry keys

Parameter

Description

Used by

tool_name

The name of the agent tool expected to run as part of an evaluation.

Tool selection

Tool execution

tool_sequence

The numbered order in which this tool, with its associated arguments and outputs, is expected to be called.

Tool selection (optional)

Tool execution (optional)

tool_input

A VARIANT of key-value pairs that map to parameters and values for this tool invocation.

For Cortex Analyst and Cortex Search tools, this VARIANT is instead the query string.

Tool execution (optional)

tool_output

A VARIANT describing the expected output from the tool.

For Cortex Analysts invoked by your agent, you can only inspect the generated SQL as part of the output. This value is contained in the SQL key:

{
  "tool_name": "my_analyst",
  "tool_output": {
    "SQL": "SELECT * FROM EXPECTED_TABLE"
  }
}
Copy

For Cortex Search invoked by your agent, you can only inspect the sources searched. This value is an ARRAY containing the name of each source, contained in the search results key:

{
  "tool_name": "my_search",
  "tool_output": {
    "search results": [
      "context searched 1",
      "context searched 2",
      "context searched N"
    ]
  }
}
Copy

Tool execution (optional)

For example, if you expect the agent to call get_weather with the inputs of city = "San Francisco" and date == 08/02/2019 and return with a VARIANT of {"temp": "14", "units": "C"}, you would define an entry in the expected outputs array as:

{
  "tool_name": "get_weather",
  "tool_input": {"city": "San Francisco", "date": "08/02/2019"}
  "tool_output": {"temp": "14", "units": "C"}
}
Copy

The following example includes the above expected tool invocation as part of a full ground truth entry, where the expected answer from your agent is The temperature was 14 degrees Celsius in San Francisco on August 2nd, 2019.

{
  "ground_truth_output": "The temperature was 14 degrees Celsius in San Francisco on August 2nd, 2019.",
  "ground_truth_invocations": [
      {
        "tool_name": "get_weather",
        "tool_input": {"city": "San Francisco", "date": "08/02/2019"}
        "tool_output": {"temp": "14", "units": "C"}
      }
  ]
}
Copy

Note

Data you provide in this column that isn’t used by a selected evaluation is ignored.

Start an agent evaluation

Begin your evaluation of a Cortex Agent by doing the following:

  1. Sign in to Snowsight.

  2. In the navigation menu, select AI & ML » Agents.

  3. Select the agent you want to conduct an evaluation of.

  4. Select the Evaluations tab.

  5. Select New evaluation run.

    The New evaluation run modal opens.

  6. In the Name field, provide a name for your evaluation. This name should be unique for the agent being evaluated.

  7. Optional: In the Description field, provide any comments for the evaluation.

  8. Select Next.

    This advances to the Select dataset modal.

  9. Select the dataset used to evaluate your agent. You can choose either Existing dataset or Create new dataset.

    To use an existing dataset:

    1. From the Database and schema list, select the database and schema containing your dataset.

    2. From the Select dataset list, select your dataset.

    To create a new dataset:

    1. From the Source table - Database and schema list, select the database and schema containing the table you want to import to a dataset.

    2. From the Select source table list, select your source table.

    3. From the New dataset location - Database and schema list, select the database and schema to place your new dataset.

    4. In the Dataset name field, enter your dataset name. This name needs to be unique among the schema-level objects in your selected schema.

  10. Select Next.

    This advances to the Select metrics modal.

  11. From the Input query list, select the column of your dataset which contains the input queries.

  12. For each of the Metrics, change the toggle to active for any metric you want included in your evaluation. Select the column of your dataset containing the ground truth for your evaluation.

  13. Select Create to create the evaluation and begin the evaluation process.

At any point, you can select Cancel to cancel creating the evaluation, or select Prev to return to the previous modal.

Inspect evaluation results

Evaluation results include information about the requested metrics, details of the agent’s threads of reasoning, and information about the LLM planning stage for each executed trace in the thread. The Evaluations tab for an agent in Snowsight gives you an overview of every evaluation run and its summary results.

To view evaluation results in Snowsight:

  1. Sign in to Snowsight.

  2. In the navigation menu, select AI & ML » Agents.

  3. Select the agent you want to conduct an evaluation of.

  4. Select the Evaluations tab.

Evaluation runs listing

The summary of run information for each run includes:

  • RUN NAME – The name of the evaluation run.

  • # OF RECORDS – The number of queries performed and answered as part of the run.

  • STATUS – The status of the evaluation run, which is one of:

    • Success indicator – All inputs were evaluated and results are available.

    • A spinner is displayed – The run is in progress, with no information available yet.

    • Warning indicator – The run experienced an error at some point. Some or all metrics may be unavailable for the run.

  • AVG DURATION – The average duration of time taken to execute an input query for the run.

  • LOGICAL CONSISTENCY – Average over all inputs of the logical consistency evaluation for the run, if requested.

  • TOOL EXECUTION ACCURACY – Average over all inputs of the tool execution accuracy evaluation for the run, if requested.

  • TOOL SELECTION ACCURACY – Average over all inputs of the tool selection accuracy evaluation for the run, if requested.

  • CREATED – The time at which the run was created and started.

Evaluation run overview

When you select an individual run in Snowsight, you’re presented with the run overview. This overview includes summary averages for each metric evaluated during the run, and a summary of each input execution. The overview for each input execution includes:

  • INPUT – The input query used for the evaluation.

  • OUTPUT – The output produced by the agent.

  • DURATION – The length of time taken to process the input and produce output.

  • LOGICAL CONSISTENCY – The logical consistency evaluation for the input, if requested.

  • TOOL EXECUTION ACCURACY – The tool execution accuracy evaluation for the input, if requested.

  • TOOL SELECTION ACCURACY – The tool selection accuracy evaluation for the input, if requested.

  • EVALUATED – The time at which the input was processed.

Record details

When you select an individual input in Snowsight, you’re presented with the Record details view. This view includes three panes: Evaluation results, Thread details, and Trace details.

Evaluation results

Your evaluation results are presented here in detail. Each metric has its own presentation box of overall average across inputs, which can be selected to display a popover containing more information. This popover contains a breakdown of the number of runs which performed at high accuracy (80% or more accurate), medium accuracy (30% or more accurate, but not high accuracy), and which failed.

Thread details

The information logged during the execution of each agent thread. This includes planning and response generation by default, as well as a thread trace for each tool that the agent invoked during that thread.

Trace details

Each trace pane includes input, processing, and output information relevant to that stage of agent execution. This information is the same as that provided by agent monitoring.

Cost considerations

Using Cortex Agent evaluations can incur highly variable cost depending on the evaluation metrics selected, optional evaluation information you supply, the structure of your agent, and other considerations. During private preview please work directly with your Snowflake representative regarding cost.