Cortex Agent evaluation features in private preview

Important

The main functionality for Cortex Agent evaluation is now in GA. The following features are still in private preview:

  • Tool selection accuracy metric

  • Tool execution accuracy metric

For information on configuring, starting, and inspecting an agent evaluation run, see Cortex Agent evaluations.

Snowflake offers the following metrics in private preview to evaluate your agent against:

  • Tool selection accuracy – Whether or not the agent selects the correct tools at the correct stages in response to your prepared query.

  • Tool execution accuracy – How closely the agent matches expected invocations of tools, and if tool output matches an expected output.

Prepare an evaluation dataset for tooling metrics

Tooling evaluations use an additional key in the output column of your dataset, ground_truth_invocations. The value of this key is an array containing JSON objects describing a tool invocation. Use the empty array [] to verify that no tools are called by the agent. The following keys are available for ground_truth_invocations:

ground_truth_invocations entry keys

Parameter

Description

Used by

tool_name

The name of the agent tool expected to run as part of an evaluation.

Tool selection

Tool execution

tool_sequence

The numbered order in which this tool, with its associated arguments and outputs, is expected to be called.

Tool selection (optional)

Tool execution (optional)

tool_input

A VARIANT of key-value pairs that map to parameters and values for this tool invocation.

For Cortex Analyst and Cortex Search tools, this VARIANT is instead the query string.

Tool execution (optional)

tool_output

A VARIANT describing the expected output from the tool.

For Cortex Analysts invoked by your agent, you can only inspect the generated SQL as part of the output. This value is contained in the SQL key:

{
  "tool_name": "my_analyst",
  "tool_output": {
    "SQL": "SELECT * FROM EXPECTED_TABLE"
  }
}

For Cortex Search invoked by your agent, you can only inspect the sources searched. This value is an ARRAY containing the name of each source, contained in the search results key:

{
  "tool_name": "my_search",
  "tool_output": {
    "search results": [
      "context searched 1",
      "context searched 2",
      "context searched N"
    ]
  }
}

Tool execution (optional)

For example, if you expect the agent to call get_weather with the inputs of city = "San Francisco" and date == 08/02/2019 and return with a VARIANT of {"temp": "14", "units": "C"}, you would define an entry in the expected outputs array as:

{
  "tool_name": "get_weather",
  "tool_input": {"city": "San Francisco", "date": "08/02/2019"},
  "tool_output": {"temp": "14", "units": "C"}
}

The following example includes the above expected tool invocation as part of a full ground truth entry, where the expected answer from your agent is The temperature was 14 degrees Celsius in San Francisco on August 2nd, 2019.

{
  "ground_truth_output": "The temperature was 14 degrees Celsius in San Francisco on August 2nd, 2019.",
  "ground_truth_invocations": [
      {
        "tool_name": "get_weather",
        "tool_input": {"city": "San Francisco", "date": "08/02/2019"},
        "tool_output": {"temp": "14", "units": "C"}
      }
  ]
}

To bring your JSON evaluation dataset into a Snowflake table, use the PARSE_JSON SQL function. The following example creates a new table agent_evaluation_data for the evaluation dataset, and inserts a row for the input query What was the temperature in San Francisco on August 2nd 2019? with the ground truth JSON from the previous example:

CREATE OR REPLACE TABLE agent_evaluation_data (
    input_query VARCHAR,
    ground_truth OBJECT
);

INSERT INTO agent_evaluation_data
  SELECT
    'What was the temperature in San Francisco on August 2nd 2019?',
    PARSE_JSON('
      {
        "ground_truth_output": "The temperature was 14 degrees Celsius in San Francisco on August 2nd, 2019.",
        "ground_truth_invocations": [
            {
              "tool_name": "get_weather",
              "tool_input": {"city": "San Francisco", "date": "08/02/2019"},
              "tool_output": {"temp": "14", "units": "C"}
            }
        ]
      }
    ');

Note

Data you provide in the ground_truth column that isn’t used by a selected evaluation is ignored.