Cortex Agent evaluation features in private preview

Important

The main functionality for Cortex Agent evaluation is now in GA. The following metrics remain in private preview:

  • Tool selection accuracy – Whether the agent’s orchestration layer invokes the tools you expect for a given query. A low score points to planning or routing logic that should be improved. Extra tool calls add cost and latency while missing calls can lead to incomplete answers.
  • Tool execution accuracy – Whether each tool that is run receives appropriate input from the agent orchestration layer and returns output that meets specified requirements. A low score points to tool configuration prompts that produce weak tool arguments, or data, semantic layer, and implementation issues depending on the tool.

For information on configuring, starting, and inspecting an agent evaluation run, see Cortex Agent evaluations.

Tool selection accuracy (TSA)

TSA compares the tool names you list in ground truth to the tool names the agent actually invoked. Agents can call tools in parallel or in multiple valid orders, so sequencing is not part of the score.

For each evaluation record, Snowflake counts how many expected tool names are matched to actual invocations (each actual call is matched at most once). The formula is:

matched tools / max(number of expected tool entries, number of actual tool calls)

This single formula penalizes every failure mode: too few calls, too many calls, or the wrong tools.

Tool execution accuracy (TEA)

TEA scores input and output quality for the tool calls you describe in ground truth. It does not re-check TSA: if you list tools in the ground truth that the agent never called, those tools do not contribute to the TEA score by themselves.

For each entry in ground_truth_invocations, an LLM judge finds the closest semantic match among the agent’s actual invocations for that tool, such that each tool call is paired at most once. The judge then scores how well the expected input and/or output align with the real invocation’s input and output.

If your ground truth lists tool calls that the agent never made, those tool calls are not matched for TEA; rely on TSA to reflect missing or extra tools relative to your expectations.

tool_input and tool_output are optional. If you omit one or both for an entry, Snowflake only evaluates what you provided; if you enable TEA but supply no input or output anywhere the metric needs them, the metric can report not applicable (NA) for those parts. Use natural language freely: you can paste SQL, JSON, tabular data, or prose. The judge interprets the strings semantically, in the same manner as answer correctness ground truth on the main evaluations page (see Cortex Agent evaluations).

Prepare an evaluation dataset for tool metrics

Tooling evaluations use an additional key in the output column of your dataset, ground_truth_invocations. The value of this key is an array of JSON objects, one per expected tool-related check. Use the empty array [] when you expect no tools to be called.

Tip

Treat tool_input and tool_output as VARCHAR-style strings in your JSON (plain text). You are not limited to a rigid JSON shape. Describe expected parameters, SQL, retrieved facts, or response snippets in whatever form is clearest. The LLM judge uses that text to judge semantic fit against what really happened.

ground_truth_invocations entry keys

KeyDescriptionUsed by
tool_nameThe name of the tool as exposed to the agent: Cortex Analyst name, Cortex Search service name, a custom tool name, or the fixed name web_search for web search. Use the same identifier the agent sees in its traces.TSA (required to define the expected tool call). TEA (required to match actual tool calls to expected tool calls).
tool_inputOptional string. Natural-language or structured text describing what input you expect the agent to pass (for example a paraphrase of the question, a SQL string, or a short JSON blob as text).TEA only (optional, provide if validation of the tool input is important).
tool_outputOptional string. Natural-language or structured text describing what you expect the tool to return (result sets, citations, JSON, and so on).TEA only (optional, provide if validation of the tool output is important).

Cortex Analyst – Use the analyst object name your agent calls. In tool_input and tool_output you can include expected SQL, expected row summaries, or both in one string.

{
  "tool_name": "finance_analyst",
  "tool_input": "User wants Q1 revenue; expect a query scoped to REVENUE_V semantic view with date filter Jan-Mar 2025.",
  "tool_output": "SQL should aggregate REVENUE by month. Result should show three rows with totals roughly 1.2M, 1.4M, 1.1M USD."
}

Cortex Search – Use the search tool or service name your agent is configured with (for example the name shown in agent settings, not a generic label unless that is the registered name). Describe what should be retrieved instead of relying on a fixed schema.

{
  "tool_name": "product_docs_search",
  "tool_input": "Query should ask for return policy and warranty text for electronics.",
  "tool_output": "Top sources should include the policy page and mention 30-day returns and 1-year warranty."
}

Web search – The tool name in traces is always web_search and is not configurable. Use that exact value for tool_name. In tool_input and tool_output, describe the query you expect and what kinds of sources or facts should appear in results. For the product feature, see Web search.

{
  "tool_name": "web_search",
  "tool_input": "User wants the current Federal Reserve policy rate and any change in the last six months.",
  "tool_output": "Results should cite recent news or official sources and include a concrete rate or date range."
}

Custom tool – Use the registered tool name. Describe arguments and outcomes in text; embed JSON or key-value prose if that helps your reviewers.

{
  "tool_name": "get_weather",
  "tool_input": "City San Francisco and date August 2, 2019.",
  "tool_output": "Temperature near 14 C and units metric."
}

Full ground truth example

The following example combines ground_truth_output with an expected invocation for a single user query:

{
  "ground_truth_output": "The temperature was 14 degrees Celsius in San Francisco on August 2nd, 2019.",
  "ground_truth_invocations": [
    {
      "tool_name": "get_weather",
      "tool_input": "San Francisco, 2019-08-02",
      "tool_output": "Temp 14 C, conditions clear."
    }
  ]
}

Note that ground_truth_invocations supports a list of multiple expected tool calls for agent queries that invoke multiple tools.

Load the dataset into a table

To bring your JSON evaluation dataset into a Snowflake table, use the PARSE_JSON SQL function. The following example creates a table agent_evaluation_data and inserts a row that includes ground_truth_invocations:

CREATE OR REPLACE TABLE agent_evaluation_data (
    input_query VARCHAR,
    ground_truth VARIANT
);

INSERT INTO agent_evaluation_data
  SELECT
    'What was the temperature in San Francisco on August 2nd 2019?',
    PARSE_JSON('
      {
        "ground_truth_output": "The temperature was 14 degrees Celsius in San Francisco on August 2nd, 2019.",
        "ground_truth_invocations": [
            {
              "tool_name": "get_weather",
              "tool_input": "San Francisco, 2019-08-02",
              "tool_output": "Temperature about 14 C, clear."
            }
        ]
      }
    ');

Note

Data you provide in the ground_truth column that is not used by a selected metric is ignored.

Enable tool metrics in YAML and Snowsight

YAML metrics section

Add tool metrics to the top-level metrics sequence in your evaluation configuration file, using the same pattern as other built-in metrics. Include the identifiers tool_selection_accuracy and tool_execution_accuracy.

metrics:
  - "answer_correctness"
  - "tool_selection_accuracy"
  - "tool_execution_accuracy"
  - "logical_consistency"

When either tool metric is enabled, your dataset ground_truth VARIANT must include ground_truth_invocations for the metrics to be computed successfully. Other enabled system metrics still require their usual keys (for example ground_truth_output for answer correctness).

Snowsight UI (system metric toggles)

When you start an evaluation in Snowsight, follow the main flow in Cortex Agent evaluations until you reach the Select metrics step. Under System metrics, turn on the toggle for Tool selection accuracy and/or Tool execution accuracy when you want those scores computed for the run, the same way you enable Answer correctness or Logical consistency.

You still choose the dataset column that holds ground truth for the run. Pick the column that contains your VARIANT with ground_truth_output and ground_truth_invocations.