Cortex Agent evaluations

Cortex Agent evaluations allow you to monitor your agent’s behavior and performance. Evaluate your agent against both ground truth and reference-free evaluation metrics. During evaluation, your agent’s activity is traced and monitored so you can ensure that each step in the process advances towards your end goal.

For evaluation methodology and dataset design guidance, see Best Practices for Evaluating Cortex Agents.

Snowflake offers the following metrics to evaluate your agent against:

  • Answer correctness – How closely the actual response for a given input query to the agent matches the expected ground truth answer.
  • Logical consistency – Measures consistency across agent instructions, planning, and tool calls. This metric is reference-free, meaning you don’t need to prepare any information in your dataset for evaluation.

Snowflake also allows you to create custom evaluation metrics that use the LLM judging process to measure context critical to your Agent’s domain and use case. Custom metrics use an LLM prompt and scoring methodology, which are passed to the evaluation judging system to produce a score.

For additional details about how agent evaluations are conducted on Snowflake, including the LLM judging system used for reference-free evaluations, see the Snowflake engineering blog What’s Your Agent’s GPA? A Framework for Evaluating AI Agent Reliability. For an example of running an Agent Evaluation programmatically, see the guide Getting Started with Cortex Agent Evaluations.

Access control requirements

The ability to run a Cortex Agent evaluation requires the role that runs the evaluation to have the following:

  • The DATABASE ROLE SNOWFLAKE.CORTEX_USER role
  • The EXECUTE TASK ON ACCOUNT permission
  • The USAGE permission on the database and schema containing your agent
  • The USAGE permission on the database and schema containing your evaluation data
    • If creating a dataset from an input table, CREATE DATASET ON SCHEMA
  • The following permissions on the current database and schema, which is where the evaluation will be run from:
    • USAGE
    • CREATE FILE FORMAT ON SCHEMA
    • CREATE TASK

Note

In Snowsight, agent evaluations are run on the database and schema of the agent. With SQL, agent evaluations are run on the session’s database and schema.

  • The USAGE or OWNERSHIP privilege on your agent
  • The MONITOR or OWNERSHIP privilege on your agent
  • If using an agent evaluation configuration, READ privilege on the stage containing the configuration file.

If the agent being evaluated uses tools, your role also needs access to all of them.

Additionally, if working with evaluations in Snowsight, the role you use to run or an inspect an evaluation needs the USAGE privilege on your default warehouse.

Prepare an evaluation dataset

Before starting a Cortex Agent evaluation, prepare a table containing your evaluation inputs. This table is used to create a dataset for your evaluation to run against. To learn more about datasets on Snowflake, see Snowflake Datasets.

Cortex Code

Cortex Code can help you create or update an evaluation dataset. Use the dataset-curation sub-skill of the Cortex Code cortex-agent skill in the CLI (see Cortex Code CLI - Skills), or select Create with Cortex Code or Manage datasets on an agent’s Evaluations tab in Snowsight, to:

  • Generate synthetic queries based on your agent configuration.
  • Import queries from production monitoring data.
  • Edit an existing dataset to add, remove, or modify queries using either source.

Cortex Code can also run the evaluation against the dataset, so you can go from dataset to results in a single flow.

Dataset format

The dataset table has two columns:

  • Input query (VARCHAR) — the user query to evaluate.
  • Ground truth (VARIANT) — a JSON object describing the expected agent behavior. This is the single value the LLM judges compare against.

The answer correctness system metric reads the ground_truth_output key from that JSON and compares its value to the agent’s streamed reply — everything the user sees, including LLM thinking, response generation, and chart generation. Because the value is fed into an LLM prompt, treat it as a plain-language rubric:

  • If the correct answer is known and stable, state it and include any rounding, tolerance, units, formatting, or scoping the response must observe (for example, “the value must be within ±2% of 123.45”).
  • If the answer changes over time or has a particular shape, describe what a correct response should and shouldn’t contain — including a format example like "Output is in the following JSON format: ..." if structure matters — in enough detail that two readers would agree on whether a given reply meets the bar.

Custom metrics read the entire VARIANT through the {{ground_truth}} placeholder, regardless of key. Use this to check process criteria the streamed reply doesn’t expose: for example, add a custom metric that references {{tool_info}} to verify which tools or tables the agent used. Keep output criteria in ground_truth_output and process criteria in the custom metric’s own keys.

For reference-free metrics like logical consistency, you don’t need a ground truth value at all. Any data not consumed by your selected metrics is ignored, so the column can be left empty for runs that use only reference-free metrics.

Ground truth examples

The following examples show ground_truth_output values you can adapt for your own dataset. Each pairs an input query with the JSON you’d put in the ground truth VARIANT column.

Static factual query

Use when the correct answer is known and stable. State the expected value and decide whether the response must match it exactly or whether rounding or a tolerance is acceptable. Add any other facets a correct reply must cover, such as scoping to the right date or excluding specific categories of records.

Input query: How many active customers does my business have as of December 31, 2025?

{
  "ground_truth_output": "There are 1,000 active customers as of December 31, 2025. The response should reference that exact date (not a different date or 'as of today') and present the count as a factual number. Rounding to the nearest hundred or a value within ±1% is acceptable; values outside that range aren't. The count shouldn't include test or churned accounts."
}
Dynamic or live data query

Use when you can’t fix a number in advance but can describe what a good response looks like.

Input query: How many orders did customers place today?

{
  "ground_truth_output": "The response should give a specific whole-number count of orders placed today and scope the count explicitly to today's date. It shouldn't return results for a different time period or claim data is unavailable without attempting retrieval, and it shouldn't hedge with phrases like 'approximately' or 'I think'. The count should be presented as a fact derived from the data."
}
Boundary or out-of-scope query

Use when the agent should refuse rather than hallucinate.

Input query: What's the weather like in New York today?

{
  "ground_truth_output": "The response should state that weather information is outside the agent's capabilities and ideally point to the kinds of questions the agent can help with. It shouldn't fabricate a forecast or present any temperatures or conditions."
}
Complex investigation

Use when the response must connect multiple facts. Logical consistency (reference-free) catches contradictions in planning and tool use; ground_truth_output catches a coherent but factually wrong explanation.

Input query: Why did our checkout conversion rate drop between March 1–7, 2025?

{
  "ground_truth_output": "The response should acknowledge that checkout conversion dropped during March 1–7, 2025, link the drop to the payment gateway timeout issue that began March 3, 2025, and note that mobile users were disproportionately affected (more than 70% of failed checkouts were mobile). It should also quantify the drop (conversion fell from ~4.2% to ~2.8%). It shouldn't attribute the drop to causes the data doesn't support, such as a marketing campaign change or seasonal trends. The causal chain (gateway timeouts to failed checkouts to conversion drop, concentrated on mobile) matters more than the order the facts appear in."
}
ScenarioWhat to put in ground_truth_output
Known answer, static dataThe specific value, any tolerance, and what to exclude
Live or changing dataA description of what a correct response should and shouldn’t contain
Off-topic or refusalWhat the agent should say, and that it shouldn’t fabricate an answer
Multi-fact investigationEach fact the response should cover, plus explanations to avoid

Insert into a Snowflake table

Populate the VARIANT column with PARSE_JSON. The following example creates agent_evaluation_data and inserts one input query with its expected answer:

CREATE OR REPLACE TABLE agent_evaluation_data (
    input_query VARCHAR,
    ground_truth VARIANT
);

INSERT INTO agent_evaluation_data
  SELECT
    'What was the temperature in San Francisco on August 2nd 2019?',
    PARSE_JSON('
      {
        "ground_truth_output": "The temperature was 14 degrees Celsius in San Francisco on August 2nd, 2019."
      }
    ');

Important

OBJECT_CONSTRUCT and ARRAY_CONSTRUCT return OBJECT and ARRAY, not VARIANT. Use PARSE_JSON, or wrap a value in TO_VARIANT, to guarantee the column type.

Create a dataset from a Snowflake table (SQL)

To create an evaluation dataset with SQL, call SYSTEM$CREATE_EVALUATION_DATASET.

Note

The column-mapping keys differ depending on how you create the dataset:

  • When you call SYSTEM$CREATE_EVALUATION_DATASET (SQL), use query_text and expected_tools in the mapping object.
  • When you define a dataset in the Agent Evaluation YAML (dataset.column_mapping), use query_text and ground_truth.

Start an agent evaluation

Cortex Code

You can also run an evaluation through Cortex Code. Use the evaluate-cortex-agent sub-skill of the Cortex Code cortex-agent skill in the CLI (see Cortex Code CLI - Skills), or continue the same Cortex Code flow from Prepare an evaluation dataset in Snowsight directly into running the evaluation against your dataset.

Snowsight

Note

Agent evaluations run as your currently selected role in Snowsight, not your default role. Make sure a role with the correct permissions is active before starting an evaluation.

Begin your evaluation of a Cortex Agent by doing the following:

  1. Sign in to Snowsight.

  2. In the navigation menu, select AI & ML » Agents.

  3. Select the agent you want to conduct an evaluation of.

  4. Select the Evaluations tab.

  5. Select New evaluation run. The New evaluation run modal opens.

  6. In the Name field, provide a name for your evaluation. This name should be unique for the agent being evaluated.

  7. Optional: In the Description field, provide any comments for the evaluation.

  8. Select Next. This advances to the Select dataset modal.

  9. Select the dataset used to evaluate your agent. You can choose either Existing dataset or Create new dataset. To use an existing dataset:

    1. From the Database and schema list, select the database and schema containing your dataset.
    2. From the Select dataset list, select your dataset.

    To create a new dataset:

    1. From the Source table - Database and schema list, select the database and schema containing the table you want to import to a dataset.
    2. From the Select source table list, select your source table.
    3. From the New dataset location - Database and schema list, select the database and schema to place your new dataset.
    4. In the Dataset name field, enter your dataset name. This name needs to be unique among the schema-level objects in your selected schema.
  10. Select Next. This advances to the Select metrics modal.

  11. From the Input query list, select the column of your dataset which contains the input queries.

  12. For each of the System metrics, change the toggle to active for any metric you want included in your evaluation. Select the column of your dataset containing the ground truth for your evaluation.

  13. (Optional) To conduct a custom evaluation, toggle on Custom metrics.

  14. Select the database and schema containing the stage where your custom evaluation configuration is stored.

  15. Select the stage where your custom evaluation configuration is stored.

  16. Select the YAML configuration file for your custom evaluation.

    Note

    In Snowsight, only the custom evaluation definitions are loaded from your YAML configuration. The rest of the YAML file must still be valid. For the evaluation YAML specification, see Agent Evaluation YAML specification.

  17. For each custom metric, change the toggle to active if you want it included in your evaluation. Select the column of your dataset containing the ground truth for this evaluation.

  18. Select Create to create the evaluation and begin the evaluation process.

At any point, you can select Cancel to cancel creating the evaluation, or select Prev to return to the previous modal.

SQL

To start or retrieve information on an evaluation with SQL, use the EXECUTE_AI_EVALUATION function. This function has the following required arguments:

  • evaluation_job: A string value of ‘START’, ‘STATUS’, or ‘DELETE’.
  • run_parameters: A SQL OBJECT containing the key run_name, with a value of the name of your run.
  • config_file_path: A stage file path pointing to your run configuration YAML file. This path can’t be a signed URL. For the evaluation YAML specification, see Agent Evaluation YAML specification.

Use the evaluation_job value ‘START’ to start an evaluation. The following example starts a run called run-1 using the agent evaluation configuration from @eval_db.eval_schema.metrics/agent_evaluation_config.yaml:

CALL EXECUTE_AI_EVALUATION(
  'START',
  OBJECT_CONSTRUCT('run_name', 'run-1'),
  '@eval_db.eval_schema.metrics/agent_evaluation_config.yaml'
);

After a run starts, you can query its progress with the evaluation_job value ‘STATUS’. This call returns a table in the format used for AI Observability Runs. The following example queries the status of the agent evaluation started from the previous example:

CALL EXECUTE_AI_EVALUATION(
  'STATUS',
  OBJECT_CONSTRUCT('run_name', 'run-1'),
  '@eval_db.eval_schema.metrics/agent_evaluation_config.yaml'
);

To delete an evaluation run, use the evaluation_job value ‘DELETE’. The following example deletes the run-1 run for the agent defined by the same configuration file:

CALL EXECUTE_AI_EVALUATION(
  'DELETE',
  OBJECT_CONSTRUCT('run_name', 'run-1'),
  '@eval_db.eval_schema.metrics/agent_evaluation_config.yaml'
);

Tip

You can call the EXECUTE_AI_EVALUATION function from a Task to regularly run an evaluation or check the status of one.

Inspect evaluation results

Evaluation results include information about the requested metrics, details of the agent’s threads of reasoning, and information about the LLM planning stage for each executed trace in the thread.

Cortex Code

In the Cortex Code CLI, the cortex-agent skill provides two sub-skills for working with completed evaluations:

  • investigate-cortex-agent-evals: Inspect evaluation runs and find any issues in your configuration or data.
  • optimize-cortex-agent: Use results from completed evaluations to suggest and test changes that improve your agent’s performance.

For more information about Cortex Code skills, see Cortex Code CLI - Skills.

Snowsight

The Evaluations tab for an agent in Snowsight gives you an overview of every evaluation run and its summary results.

To view evaluation results in Snowsight:

  1. Sign in to Snowsight.
  2. In the navigation menu, select AI & ML » Agents.
  3. Select the agent you want to conduct an evaluation of.
  4. Select the Evaluations tab.

Evaluation runs listing

The summary of run information for each run includes:

  • RUN NAME – The name of the evaluation run.

  • # OF RECORDS – The number of queries performed and answered as part of the run.

  • STATUS – The status of the evaluation run, which is one of:

    • Success indicator – All inputs were evaluated and results are available.
    • A spinner is displayed – The run is in progress, with no information available yet.
    • Warning indicator – The run experienced an error at some point. Some or all metrics may be unavailable for the run.
  • DATASET – The name of the dataset used for the evaluation.

  • AVG DURATION – The average duration of time taken to execute an input query for the run.

  • LOGICAL CONSISTENCY – Average over all inputs of the logical consistency evaluation for the run, if requested.

  • DESCRIPTION – The description of the evaluation run.

  • CREATED – The time at which the run was created and started.

Each custom metric evaluated for this run also receives its own column, defined by the evaluation metric name value. For more information on custom metrics, see Defining a custom metric.

Evaluation run overview

When you select an individual run in Snowsight, you’re presented with the run overview. This overview includes summary averages for each metric evaluated during the run, and a summary of each input execution. The overview for each input execution includes:

  • STATUS – The status of the evaluation run, which is one of:

    • Success indicator – All inputs were evaluated and results are available.
    • A spinner is displayed – The run is in progress, with no information available yet.
    • Warning indicator – The run experienced an error at some point. Some or all metrics may be unavailable for the run.
  • INPUT – The input query used for the evaluation.

  • OUTPUT – The output produced by the agent.

  • DURATION – The length of time taken to process the input and produce output.

  • LOGICAL CONSISTENCY – The logical consistency evaluation for the input, if requested.

  • EVALUATED – The time at which the input was processed.

Each custom metric evaluated for this run also receives its own column, defined by the evaluation metric name value. For more information about custom metrics, see Defining a custom metric.

View details (errors and metric warnings)

After you open a run from the evaluation runs listing, select View details on the right to open the detailed view for that run. Scroll down in this view to find error logs and other diagnostic information when something fails or returns partial results.

In the per-input table for the run, if metric computation has a problem for a specific row, a warning indicator can appear on the left side of that row. Hover over the warning to see details about the metric issue.

Record details

When you select an individual input in Snowsight, you’re presented with the Record details view. This view includes three panes: Evaluation results, Thread details, and Trace details.

Evaluation results

Your evaluation results are presented here in detail. Each metric has its own presentation box of overall average across inputs, which can be selected to display a popover containing more information. This popover contains a breakdown of the number of runs which performed at high accuracy (80% or more accurate), medium accuracy (30% or more accurate, but not high accuracy), and which failed.

Thread details

The information logged during the execution of each agent thread. This includes planning and response generation by default, as well as a thread trace for each tool that the agent invoked during that thread.

Trace details

Each trace pane includes input, processing, and output information relevant to that stage of agent execution. This information is the same as that provided by agent monitoring.

SQL

Important

Observability redaction and evaluations

The READ UNREDACTED AI OBSERVABILITY EVENTS TABLE account privilege and default redaction of certain raw fields in AI_OBSERVABILITY_EVENTS apply to Cortex Agent monitoring in Snowsight and to observability user-defined table functions used on the monitoring data path, as described in Monitor Cortex Agent requests and Account Privilege READ UNREDACTED AI OBSERVABILITY EVENTS TABLE. This does not change Cortex Agent evaluation run execution, how metrics are computed, or how evaluation results and scores are shown in the Evaluations experience.

To retrieve raw evaluation details, use the GET_AI_EVALUATION_DATA (SNOWFLAKE.LOCAL) function. This function has the following required arguments:

  • database: The database containing the agent.
  • schema: The schema containing the agent.
  • agent_name: The name of the agent.
  • agent_type: CORTEX AGENT or EXTERNAL AGENT. This value is case-insensitive.
  • run_name: The name of the evaluation run to retrieve.

This function returns a table of event data described in Evaluation results table format. The following example displays the full evaluation details for a run called run-1, where the agent is named evaluated_agent stored on the schema eval_db.eval_schema:

SELECT * FROM TABLE(SNOWFLAKE.LOCAL.GET_AI_EVALUATION_DATA(
  'eval_db',
  'eval_schema',
  'evaluated_agent',
  'CORTEX AGENT',
  'run-1')
);

Query traces for a single record

To access a single record from an evaluation trace, use the GET_AI_RECORD_TRACE (SNOWFLAKE.LOCAL) function. This function has the following required arguments:

  • database: The database containing the agent.
  • schema: The schema containing the agent.
  • agent_name: The name of the agent.
  • agent_type: CORTEX AGENT or EXTERNAL AGENT. This value is case-insensitive.
  • record_id: The record ID to filter by.

This function returns a table of event data described in Evaluation results table format. The following example displays the trace for the record 9346efc3-5dd6-4038-9b1a-72ca3d3b768c, where the agent is named evaluated_agent stored on the schema eval_db.eval_schema:

SELECT * FROM TABLE(SNOWFLAKE.LOCAL.GET_AI_RECORD_TRACE(
  'eval_db',
  'eval_schema',
  'evaluated_agent',
  'CORTEX AGENT',
  '9346efc3-5dd6-4038-9b1a-72ca3d3b768c'
));

Query evaluation errors and warnings for a run

To access logs for warnings and errors that happened during an evaluation run, use the GET_AI_OBSERVABILITY_LOGS (SNOWFLAKE.LOCAL) function. This function has the following required arguments:

  • database: The database containing the agent.
  • schema: The schema containing the agent.
  • agent_name: The name of the agent.
  • agent_type: CORTEX AGENT or EXTERNAL AGENT. This value is case-insensitive.

This function returns a table of event data described in Evaluation results table format. The following example checks for errors and warnings for a run called run-1, where the agent is named evaluated_agent stored on the schema eval_db.eval_schema:

SELECT * FROM TABLE(SNOWFLAKE.LOCAL.GET_AI_OBSERVABILITY_LOGS(
  'eval_db',
  'eval_schema',
  'evaluated_agent',
  'CORTEX AGENT')
)
  WHERE TRUE
  AND (record:"severity_text"='ERROR' or record:"severity_text"='WARN')
  AND record_attributes:"snow.ai.observability.run.name"='run-1';

Note

The fields of record and record_attributes are subject to change, but the fields record:"severity_text" and record_attributes:"snow.ai.observability.run.name" are guaranteed to be present in AI Observability logs.

Agent Evaluation YAML specification

To define the YAML file to configure an Agent Evaluation, including defining custom metrics, there are three top-level keys:

  • (Optional) dataset: A definition of how to create a dataset for the evaluation. This value is optional when using a YAML specification to start an evaluation in Snowsight, or when using an existing dataset.
  • evaluation: Settings for the agent to be evaluated.
  • metrics: The metrics recorded during an evaluation run, including definitions for custom metrics.

Dataset definition

The dataset value defines a new dataset from existing table data, mapping columns for the input query and ground truth. For the structure required for your ground_truth column, see Dataset format. The keys for the dataset value are:

  • dataset_type: The string constant “CORTEX AGENT”. This value is case-insensitive.
  • table_name: The fully qualified name of the table to use for the dataset’s contents.
  • dataset_name: The name of the created dataset.
  • column_mapping: The mapping of the required evaluation input column query_text and output column ground_truth to columns of the table to create the dataset from.

The resulting dataset is stored in the same database and schema as the table it’s constructed from.

Important

When you call EXECUTE_AI_EVALUATION with START and the YAML still contains dataset:, Snowflake attempts to create the dataset on every run. If a dataset with the same dataset_name already exists, the run can fail (for example with an error that a dataset or internal dataset version already exists). That can happen even when you only change run_name between runs, or after a previous attempt failed after the dataset was created.

Pattern for repeated runs on the same dataset: Remove the entire dataset: top-level block from the YAML. Keep evaluation: (with source_metadata referencing the existing dataset_name) and metrics:. This matches how you run another evaluation against an existing dataset without re-importing the table.

When you need a new dataset from the same or updated source table (for example after you change rows), use a new dataset_name in dataset:, or create a dataset with SYSTEM$CREATE_EVALUATION_DATASET and reference that name in evaluation.source_metadata without embedding dataset: in the YAML you use for the run.

The following example dataset definition shows a dataset named evaluation_input created from the evals_db.evals_schema.evaluation_data table, using the user_question as input and expected_outcome to define ground truth:

dataset:
  dataset_type: "CORTEX AGENT"
  table_name: "evals_db.evals_schema.evaluation_data"
  dataset_name: "evaluation_input"
  column_mapping:
    query_text: "user_question"
    ground_truth: "expected_outcome"

Agent configuration

The evaluation value sets the configuration for the agent to conduct an evaluation against. The keys for the evaluation value are:

  • agent_params: A dictionary describing the agent to conduct the evaluation for. This value uses the keys:

    • agent_name: The name of the agent to evaluate.
    • agent_type: The string constant “CORTEX AGENT”. This value is case-insensitive.
  • (Optional) run_params: Metadata for identifying this evaluation run. This value uses the keys:

    • (Optional) label: The label for this evaluation.
    • (Optional) description: A detailed description of the evaluation.
  • source_metadata: A dictionary describing the dataset used for the evaluation. This value uses the keys:

    • type: The string constant dataset. This value is case-sensitive.
    • dataset_name: The name of the dataset to use.

The following example agent configuration runs an agent named evaluated_agent with the label Basic evaluation, using the dataset evaluation_input:

evaluation:
  agent_params:
    agent_name: "evaluated_agent"
    agent_type: "CORTEX AGENT"
  run_params:
    label: "Basic evaluation"
  source_metadata:
    type: "dataset"
    dataset_name: "evaluation_input"

Note

Note that the agent name is relative to the current database and schema. You can also provide the fully qualified name of the agent.

Metrics selection

The metrics value is a sequence of metrics to evaluate, including your own custom metric definitions. The accepted values for pre-defined metrics are:

  • answer_correctness: Measure how closely the expected ground truth answer for a given input query matches the actual response streamed from the agent.
  • logical_consistency: Measure consistency across agent instructions, planning, and tool calls. This metric is reference-free and doesn’t use a dataset.

Defining a custom metric

You can define your own custom metric by providing an identifier, prompt, and score ranges. The prompt you provide is passed to an LLM judge along with run traces to conduct your custom evaluation. Custom metrics have the following required key-value pairs:

  • name: The name of the metric.

  • score_ranges: A mapping that defines low, medium, and high-quality score ranges. This mapping uses the keys:

    • min_score: The score range used to identify low-quality results, as a two-element sequence of the inclusive lower bound to exclusive upper bound.
    • median_score: The score range used to identify medium-quality results, as a two-element sequence of the inclusive lower bound to inclusive upper bound.
    • max_score: The score range used to identify high-quality results, as a two-element sequence of the exclusive lower bound to inclusive upper bound.
  • prompt: The prompt template to pass to the LLM judge along with the agent run trace data.

    Important

    This template must include a scoring mechanism which produces a numeric value represented in the ranges provided for score_ranges.

A custom metric’s prompt is able to reference the trace data generated by the agent during an evaluation run. Snowflake passes the entire trace as input to the LLM judge, but you can emphasize certain information by using a replacement string that references data in a GET_AI_RECORD_TRACE column directly. The following replacement strings are available:

Replacement stringGET_AI_RECORD_TRACE column
{{input}}INPUT
{{output}}OUTPUT
{{ground_truth}}GROUND_TRUTH
{{tool_info}}TOOL
{{start_timestamp}}START_TIMESTAMP
{{duration}}DURATION_MS
{{span_id}}SPAN_ID
{{span_type}}SPAN_TYPE
{{span_name}}SPAN_NAME
{{llm_model}}LLM_MODEL
{{error}}ERROR
{{status}}STATUS

Metrics configuration example

The following example defines a metrics configuration that enables answer correctness and logical consistency checks, and also defines a custom relevance metric which returns a score between 1-10 based on how ground truth compares against agent output:

metrics:
  # Built-in metrics
  - "answer_correctness"
  - "logical_consistency"
  # Custom metric with prompt
  - name: "relevance"
    score_ranges:
      min_score: [1, 3]
      median_score: [4, 6]
      max_score: [7, 10]
    prompt: |
      Evaluate the relevance of the agent's response to the user's query.
      Rate from 1-10 where:
      1 = Completely irrelevant
      4 = Somewhat irrelevant
      6 = Neutral
      8 = Mostly relevant
      10 = Highly relevant and on-topic

      You can compare the {{output}} with the {{ground_truth}} to help you understand if the contents are relevant or not

      Consider:
      - Does the response address the user's question?
      - Is the information provided appropriate to the context?
      - Are there any tangential or off-topic elements?

Full example configuration

Combining all of the previous example sections gives a full Agent Evaluation configuration:

# Optional: Create dataset before running evaluation
dataset:
  dataset_type: "CORTEX AGENT"
  table_name: "EVALS_DB.EVALS_SCHEMA.EVALUATION_DATA"
  dataset_name: "EVALUATION_INPUT"
  column_mapping:
    query_text: "user_question"
    ground_truth: "expected_outcome"

# Evaluation task configuration
evaluation:
  agent_params:
    agent_name: "evaluated_agent"
    agent_type: "CORTEX AGENT"
  run_params:
    label: "Basic evaluation"
  source_metadata:
    type: "dataset"
    dataset_name: "EVALUATION_INPUT"

metrics:
  # Built-in metrics (simple strings)
  - "answer_correctness"
  - "logical_consistency"

  # Custom metric definition
  - name: "relevance"
    score_ranges:
      min_score: [1, 3]
      median_score: [4, 6]
      max_score: [7, 10]
    prompt: |
      Evaluate the relevance of the agent's response to the user's query.
      Rate from 1-10 where:
      1 = Completely irrelevant
      4 = Somewhat irrelevant
      6 = Neutral
      8 = Mostly relevant
      10 = Highly relevant and on-topic

      You can compare the {{output}} with the {{ground_truth}} to help you understand if the contents are relevant or not

      Consider:
      - Does the response address the user's question?
      - Is the information provided appropriate to the context?
      - Are there any tangential or off-topic elements?

Upload configuration to a stage

Agent Evaluation configurations are required to have a specific file format for Snowflake to parse them. The following snippet demonstrates creating the required yaml_file_format on the schema evals_db.evals_schema, then creates the stage evaluation_config to upload an agent configuration to:

CREATE OR REPLACE FILE FORMAT evals_db.evals_schema.yaml_file_format
  TYPE = 'CSV'
  FIELD_DELIMITER = NONE
  RECORD_DELIMITER = '\n'
  SKIP_HEADER = 0
  FIELD_OPTIONALLY_ENCLOSED_BY = NONE
  ESCAPE_UNENCLOSED_FIELD = NONE;

CREATE OR REPLACE STAGE evals_db.evals_schema.evaluation_config
  FILE_FORMAT = evals_db.evals_schema.yaml_file_format;

Upload your configuration to a created stage through Snowsight by navigating to In the navigation menu, select Ingestion » Add Data and selecting Load files into a Stage. You can also use the SQL PUT command to upload a local YAML file. The following example demonstrates copying the local file /Users/dev/evaluation_config.yaml to the stage evals_db.evals_schema.evaluation_config:

PUT file:///Users/dev/evaluation_config.yaml @evals_db.evals_schema.evaluation_config
  AUTO_COMPRESS='false'
  OVERWRITE=TRUE;

If you create your YAML in a Workspace, you can copy it from your active workspace to a stage. The following example copies the file evaluation_config.yaml from your workspace to the stage evals_db.evals_schema.evaluation_config:

COPY FILES INTO @evals_db.evals_schema.evaluation_config
  FROM 'snow://workspace/USER$.PUBLIC.DEFAULT$/versions/live'
  FILES=('custom_metric_config.yaml');

Tip

Snowflake recommends keeping your YAML file uncompressed.

Evaluation results table format

Functions which return information about a Cortex Agent evaluation all produce a table with the following columns:

ColumnData typeDescription
RECORD_IDVARCHARThe unique identifier assigned by Snowflake for this evaluation record.
INPUT_IDVARCHARThe unique identifier assigned by Snowflake for this evaluation input.
REQUEST_IDVARCHARThe unique identifier assigned by Snowflake for this request.
TIMESTAMPTIMESTAMP_TZThe time (in UTC) at which the request was made.
DURATION_MSINTThe amount of time, in milliseconds, that it took for the agent to return a response.
INPUTVARCHARThe query string used as input for this evaluation record.
OUTPUTVARCHARThe response returned by the Cortex Agent for this evaluation record.
ERRORVARCHARInformation about any errors that occurred during the request.
GROUND_TRUTHVARCHARThe ground truth information used to evaluate this record’s Cortex Agent output. This column holds the JSON from your dataset’s ground truth column, serialized as a string. For how {{ground_truth}} in custom metrics relates to this value, see the notes under Evaluation results table format.
METRIC_NAMEVARCHARThe name of the metric evaluated for this record.
EVAL_AGG_SCORENUMBERThe evaluation score assigned for this record.
METRIC_TYPEVARCHARThe type of metric being evaluated. For built-in metrics, the value is system. For custom metrics, the value is custom.
METRIC_STATUSVARIANT

A map containing information about the agent’s HTTP response for this record, with the following keys:

  • status: The HTTP status code of the response.
  • message: The HTTP message sent in the status response.
METRIC_CALLSARRAY

An array of VARIANT values that contain information about the computed metric. Each array entry contains the metric’s criteria, an explanation of the metric score, and metadata. The keys of each entry are:

  • criteria: The criteria used by an LLM judge to evaluate response correctness.
  • explanation: An explanation of why the score was assigned.
  • full_metadata: A VARIANT value that contains metadata and information about this metric’s processing by the LLM judge. The keys of this map include:
    • completion_tokens: The number of output tokens generated by the LLM for this metric evaluation call.
    • normalized_score: The original evaluation score normalized to the range [0.0, 1.0], rounded to two decimal places.
    • original_score: The original score assigned by this metric evaluation for the record.
    • prompt_tokens: The number of tokens taken up by the prompt provided to the LLM judge.
    • total_tokens: The total number of tokens used by the LLM judge for this computation.
TOTAL_INPUT_TOKENSINTThe total number of tokens used to process the input query.
TOTAL_OUTPUT_TOKENSINTThe total number of output tokens produced by the Cortex Agent.
LLM_CALL_COUNTINTCounts the number of times any LLM was called, either by the agent or an evaluation judge.

The GROUND_TRUTH column contains the full JSON from your dataset’s ground truth VARIANT, serialized as a string. In custom metric prompts, the {{ground_truth}} replacement string is substituted with that same serialized content, so a custom LLM judge can use any JSON shape you stored (not only keys such as ground_truth_output or ground_truth_invocations). System metrics still require JSON that matches what each metric expects (for example, ground_truth_output for answer correctness). For dataset column requirements, see Dataset format.

Model availability

Agent Evaluations currently only supports the following models, using cross-region inference. Snowflake automatically chooses from these models based on your account settings.

ModelCross Cloud (Any Region)AWS USAWS US Commercial GovAWS EUAWS APJ
claude-4-sonnet

Known limitations

Cortex Agent evaluations are subject to the following limitations:

  • Agent response times and throughput: The number of inputs that can be processed during an evaluation is constrained by agent response times and the amount of trace detail. If you experience timeouts or long delays in your evaluation, split your evaluation data. For example, if you have queries which are guaranteed to invoke many different tools, you can partition data by common tool invocation. If you have a custom evaluation that results in timeouts, refine or shorten your prompt. You may also want to consider splitting custom evaluations to only focus on one specific element of your agent’s output.
  • Ground truth staleness: Depending on how you word your input queries, results may drift over time and result in less accurate evaluation results. In particular you should try and scope input queries to specific, absolute dates and times. As an example, both of the input queries What was our revenue? and What was our revenue for the first quarter? will experience drift, while the query What was our revenue between January and March of 2025? is scoped to a specific window of time that can be consistently referenced in the evaluation data.

Cost Considerations

Agent Evaluations run a Cortex Agent to create output for evaluation, and LLM judges to compute the evaluation metrics. You’re charged for each run of the agent against a ground truth query. The evaluation’s LLM judges are run by the AI_COMPLETE function, and you incur charges based on the model Snowflake selects for judging. Additionally, you’re charged for the following:

  • Warehouse charges for tasks used to manage evaluation runs
  • Warehouse charges for queries used to compute evaluation metrics
  • Storage charges for datasets and evaluation results
  • Warehouse charges to retrieve evaluation results viewed in Snowsight

For more information on estimating costs, see Understanding overall cost. Refer to the Snowflake Service Consumption Table for full cost information.