Cortex Agent evaluations¶

Cortex Agent evaluations allow you to monitor your agent’s behavior and performance. Evaluate your agent against both ground truth and reference-free evaluation metrics. During evaluation, your agent’s activity is traced and monitored so you can ensure that each step in the process advances towards your end goal.

Snowflake offers the following metrics to evaluate your agent against:

Answer correctness – How closely the actual response for a given input query to the agent matches the expected ground truth answer.
Logical consistency – Measures consistency across agent instructions, planning, and tool calls. This metric is reference-free, meaning you don’t need to prepare any information in your dataset for evaluation.

Snowflake also allows you to create custom evaluation metrics that use the LLM judging process to measure context critical to your Agent’s domain and use case. Custom metrics use an LLM prompt and scoring methodology, which are passed to the evaluation judging system to produce a score.

For additional details about how agent evaluations are conducted on Snowflake, including the LLM judging system used for reference-free evaluations, see the Snowflake engineering blog What’s Your Agent’s GPA? A Framework for Evaluating AI Agent Reliability. For an example of running an Agent Evaluation programmatically, see the guide Getting Started with Cortex Agent Evaluations.

Access control requirements¶

Tip

A user with the ACCOUNTADMIN role can grant the SNOWFLAKE.AI_OBSERVABILITY_READER application role to another role so that users can run read-only queries on SNOWFLAKE.LOCAL.AI_OBSERVABILITY_EVENTS for Cortex Agent evaluations.

The ability to run a Cortex Agent evaluation requires a role with the following:

The DATABASE ROLE SNOWFLAKE.CORTEX_USER role
The EXECUTE TASK ON ACCOUNT permission
The USAGE permission on the database and schema containing your agent
The USAGE permission on the database and schema containing your evaluation data - If creating a dataset from an input table, CREATE DATASET ON SCHEMA
The following permissions on the current database and schema, which is where the evaluation will be run from:
- USAGE
- CREATE FILE FORMAT ON SCHEMA
- CREATE TASK

Note

In Snowsight, agent evaluations are run on the database and schema of the agent. With SQL, agent evaluations are run on the session’s database and schema.

The USAGE or OWNERSHIP privilege on your agent
The MONITOR or OWNERSHIP privilege on your agent
If using an agent evaluation configuration, READ privilege on the stage containing the configuration file.

If the agent being evaluated uses tools, your role also needs access to all of them.

Additionally, if working with evaluations in Snowsight, the role you use to run or an inspect an evaluation needs the USAGE privilege on your default warehouse.

Prepare an evaluation dataset¶

Before starting a Cortex Agent evaluation, prepare a table containing your evaluation inputs. This table is used to create a dataset for your evaluation to run against. To learn more about datasets on Snowflake, see Snowflake Datasets.

Cortex Code¶

To have Cortex Code assist you with creating a dataset for your evaluation, use the dataset-curation sub-skill of the Cortex Code cortex-agent skill. For more information about Cortex Code skills, see Cortex Code CLI - Skills.

Dataset format¶

The table used to create a dataset for evaluation has an input query column of type VARCHAR that represents your query, and an output column of type VARIANT that contains a description of expected agent behavior. This single output column is used as the ground truth by the LLM judge.

Values in the output column often include the key ground_truth_output. The value of that key is used by the answer correctness system metric. The value is most powerful when its accuracy is not dependent on relative time (i.e. last week versus January 2026), and when the ground truth clearly explains the criteria for success, such as specific criteria for data accuracy. LLM judges compare that ground truth to the full streamed reply the user would see. This includes all spans from LLM planning, tool calls, LLM response generation, or chart generation.

You can store any valid JSON object in this column. Custom metrics can reference the entire JSON through the {{ground_truth}} placeholder in the metric prompt (the serialized content of your ground truth VARIANT). That behavior is not limited to a particular key such as ground_truth_output. For example, you can use an object like {"ground_truth_output":"...","custom_a":"...","custom_b":"..."} and read those fields in a custom LLM judge via {{ground_truth}}. System metrics expect specific JSON shapes (for example, answer correctness relies on ground_truth_output). Choose your JSON structure based on which system and custom metrics you enable.

Tip

Take advantage of the fact that ground truth is included in an LLM prompt by using natural language to describe a type of response, in addition to exact or semantic response matches. For example, you could provide a ground truth of Output is in the following JSON format ... followed by a string containing either a description of the structure or a JSON example itself. If you need a more rigorous examination of output based on a full custom prompt, create a custom metric.

To bring a JSON dataset into a Snowflake table, use the PARSE_JSON SQL function. The following example creates a new table agent_evaluation_data to use for an evaluation dataset, and inserts a row for the input query What was the temperature in San Francisco on August 2nd 2019? with the ground truth of The temperature was 14 degrees Celsius in San Francisco on August 2nd, 2019..

CREATE OR REPLACE TABLE agent_evaluation_data (
    input_query VARCHAR,
    ground_truth VARIANT
);

INSERT INTO agent_evaluation_data
  SELECT
    'What was the temperature in San Francisco on August 2nd 2019?',
    PARSE_JSON('
      {
        "ground_truth_output": "The temperature was 14 degrees Celsius in San Francisco on August 2nd, 2019."
      }
    ');

Important

The functions OBJECT_CONSTRUCT and ARRAY_CONSTRUCT return non-VARIANT results. Use a function that produces a VAIRANT from your raw input like PARSE_JSON, or call TO_VARIANT to guarantee the value type.

Data you provide in the ground_truth column that isn’t used by a selected metric is ignored. When conducting an evaluation run with only reference-free metrics, you can leave the output column empty.

When running your first evaluation, you’ll have the option to create a new dataset from an existing table.

Create a dataset from a Snowflake table (SQL)¶

To create an evaluation dataset with SQL, call SYSTEM$CREATE_EVALUATION_DATASET.

Start an agent evaluation¶

Cortex Code¶

To have Cortex Code run an evaluation, use the evaluate-cortex-agent sub-skill of the Cortex Code cortex-agent skill. For more information about Cortex Code skills, see Cortex Code CLI - Skills.

Snowsight¶

Note

Agent evaluations run as your currently selected role in Snowsight, not your default role. Make sure a role with the correct permissions is active before starting an evaluation.

Begin your evaluation of a Cortex Agent by doing the following:

Sign in to Snowsight.
In the navigation menu, select AI & ML » Agents.
Select the agent you want to conduct an evaluation of.
Select the Evaluations tab.
Select New evaluation run.

The New evaluation run modal opens.
In the Name field, provide a name for your evaluation. This name should be unique for the agent being evaluated.
Optional: In the Description field, provide any comments for the evaluation.
Select Next.

This advances to the Select dataset modal.
Select the dataset used to evaluate your agent. You can choose either Existing dataset or Create new dataset.
To use an existing dataset:
1. From the Database and schema list, select the database and schema containing your dataset.
2. From the Select dataset list, select your dataset.
To create a new dataset:
1. From the Source table - Database and schema list, select the database and schema containing the table you want to import to a dataset.
2. From the Select source table list, select your source table.
3. From the New dataset location - Database and schema list, select the database and schema to place your new dataset.
4. In the Dataset name field, enter your dataset name. This name needs to be unique among the schema-level objects in your selected schema.
Select Next.

This advances to the Select metrics modal.
From the Input query list, select the column of your dataset which contains the input queries.
For each of the System metrics, change the toggle to active for any metric you want included in your evaluation. Select the column of your dataset containing the ground truth for your evaluation.
(Optional) To conduct a custom evaluation, toggle on Custom metrics.
1. Select the database and schema containing the stage where your custom evaluation configuration is stored.
2. Select the stage where your custom evaluation configuration is stored.
3. Select the YAML configuration file for your custom evaluation.
  
  Note
  
  In Snowsight, only the custom evaluation definitions are loaded from your YAML configuration. The rest of the YAML file must still be valid. For the evaluation YAML specification, see Agent Evaluation YAML specification.
4. For each custom metric, change the toggle to active if you want it included in your evaluation. Select the column of your dataset containing the ground truth for this evaluation.
Select Create to create the evaluation and begin the evaluation process.

At any point, you can select Cancel to cancel creating the evaluation, or select Prev to return to the previous modal.

SQL¶

To start or retrieve information on an evaluation with SQL, use the EXECUTE_AI_EVALUATION function. This function has the following required arguments:

evaluation_job: A string value of ‘START’ or ‘STATUS’.
run_parameters: A SQL OBJECT containing the key run_name, with a value of the name of your run.
config_file_path: A stage file path pointing to your run configuration YAML file. This path can’t be a signed URL. For the evaluation YAML specification, see Agent Evaluation YAML specification.

Use the evaluation_job value ‘START’ to start an evaluation. The following example starts a run called run-1 using the agent evaluation configuration from @eval_db.eval_schema.metrics/agent_evaluation_config.yaml:

CALL EXECUTE_AI_EVALUATION(
  'START',
  OBJECT_CONSTRUCT('run_name', 'run-1'),
  '@eval_db.eval_schema.metrics/agent_evaluation_config.yaml'
);

After a run starts, you can query its progress with the evaluation_job value ‘STATUS’. This call returns a table in the format used for AI Observability Runs. The following example queries the status of the agent evaluation started from the previous example:

CALL EXECUTE_AI_EVALUATION(
  'STATUS',
  OBJECT_CONSTRUCT('run_name', 'run-1'),
  '@eval_db.eval_schema.metrics/agent_evaluation_config.yaml'
);

Tip

You can call the EXECUTE_AI_EVALUATION function from a Task to regularly run an evaluation or check the status of one.

Inspect evaluation results¶

Evaluation results include information about the requested metrics, details of the agent’s threads of reasoning, and information about the LLM planning stage for each executed trace in the thread.

Cortex Code¶

Cortex Code offers two sub-skills of the cortex-agent skill. Use the investigate-cortex-agent-evals sub-skill to inspect evaluations and find any issues in your configuration or data. Use the optimize-cortex-agent sub-skill to take results from completed evaluations and improve the performance of your agent.

Snowsight¶

The Evaluations tab for an agent in Snowsight gives you an overview of every evaluation run and its summary results.

To view evaluation results in Snowsight:

Sign in to Snowsight.
In the navigation menu, select AI & ML » Agents.
Select the agent you want to conduct an evaluation of.
Select the Evaluations tab.

Evaluation runs listing¶

The summary of run information for each run includes:

RUN NAME – The name of the evaluation run.
# OF RECORDS – The number of queries performed and answered as part of the run.
STATUS – The status of the evaluation run, which is one of:
- – All inputs were evaluated and results are available.
- A spinner is displayed – The run is in progress, with no information available yet.
- – The run experienced an error at some point. Some or all metrics may be unavailable for the run.
DATASET – The name of the dataset used for the evaluation.
AVG DURATION – The average duration of time taken to execute an input query for the run.
LOGICAL CONSISTENCY – Average over all inputs of the logical consistency evaluation for the run, if requested.
DESCRIPTION – The description of the evaluation run.
CREATED – The time at which the run was created and started.

Each custom metric evaluated for this run also receives its own column, defined by the evaluation metric name value. For more information on custom metrics, see Defining a custom metric.

Evaluation run overview¶

When you select an individual run in Snowsight, you’re presented with the run overview. This overview includes summary averages for each metric evaluated during the run, and a summary of each input execution. The overview for each input execution includes:

STATUS – The status of the evaluation run, which is one of:
- – All inputs were evaluated and results are available.
- A spinner is displayed – The run is in progress, with no information available yet.
- – The run experienced an error at some point. Some or all metrics may be unavailable for the run.
INPUT – The input query used for the evaluation.
OUTPUT – The output produced by the agent.
DURATION – The length of time taken to process the input and produce output.
LOGICAL CONSISTENCY – The logical consistency evaluation for the input, if requested.
EVALUATED – The time at which the input was processed.

Each custom metric evaluated for this run also receives its own column, defined by the evaluation metric name value. For more information about custom metrics, see Defining a custom metric.

View details (errors and metric warnings)¶

After you open a run from the evaluation runs listing, select View details on the right to open the detailed view for that run. Scroll down in this view to find error logs and other diagnostic information when something fails or returns partial results.

In the per-input table for the run, if metric computation has a problem for a specific row, a warning indicator can appear on the left side of that row. Hover over the warning to see details about the metric issue.

Record details¶

When you select an individual input in Snowsight, you’re presented with the Record details view. This view includes three panes: Evaluation results, Thread details, and Trace details.

Evaluation results¶

Your evaluation results are presented here in detail. Each metric has its own presentation box of overall average across inputs, which can be selected to display a popover containing more information. This popover contains a breakdown of the number of runs which performed at high accuracy (80% or more accurate), medium accuracy (30% or more accurate, but not high accuracy), and which failed.

Thread details¶

The information logged during the execution of each agent thread. This includes planning and response generation by default, as well as a thread trace for each tool that the agent invoked during that thread.

Trace details¶

Each trace pane includes input, processing, and output information relevant to that stage of agent execution. This information is the same as that provided by agent monitoring.

SQL¶

Important

Observability redaction and evaluations

The READ UNREDACTED AI OBSERVABILITY EVENTS TABLE account privilege and default redaction of certain raw fields in AI_OBSERVABILITY_EVENTS apply to Cortex Agent monitoring in Snowsight and to observability user-defined table functions used on the monitoring data path, as described in Monitor Cortex Agent requests and Account Privilege READ UNREDACTED AI OBSERVABILITY EVENTS TABLE. This does not change Cortex Agent evaluation run execution, how metrics are computed, or how evaluation results and scores are shown in the Evaluations experience.

To retrieve raw evaluation details, use the GET_AI_EVALUATION_DATA (SNOWFLAKE.LOCAL) function. This function has the following required arguments:

database: The database containing the agent.
schema: The schema containing the agent.
agent_name: The name of the agent.
agent_type: CORTEX AGENT or EXTERNAL AGENT. This value is case-insensitive.
run_name: The name of the evaluation run to retrieve.

This function returns a table of event data described in Evaluation results table format. The following example displays the full evaluation details for a run called run-1, where the agent is named evaluated_agent stored on the schema eval_db.eval_schema:

SELECT * FROM TABLE(SNOWFLAKE.LOCAL.GET_AI_EVALUATION_DATA(
  'eval_db',
  'eval_schema',
  'evaluated_agent',
  'CORTEX AGENT',
  'run-1')
);

Query traces for a single record¶

To access a single record from an evaluation trace, use the GET_AI_RECORD_TRACE (SNOWFLAKE.LOCAL) function. This function has the following required arguments:

database: The database containing the agent.
schema: The schema containing the agent.
agent_name: The name of the agent.
agent_type: CORTEX AGENT or EXTERNAL AGENT. This value is case-insensitive.
record_id: The record ID to filter by.

This function returns a table of event data described in Evaluation results table format. The following example displays the trace for the record 9346efc3-5dd6-4038-9b1a-72ca3d3b768c, where the agent is named evaluated_agent stored on the schema eval_db.eval_schema:

SELECT * FROM TABLE(SNOWFLAKE.LOCAL.GET_AI_RECORD_TRACE(
  'eval_db',
  'eval_schema',
  'evaluated_agent',
  'CORTEX AGENT',
  '9346efc3-5dd6-4038-9b1a-72ca3d3b768c'
));

Query evaluation errors and warnings for a run¶

To access logs for warnings and errors that happened during an evaluation run, use the GET_AI_OBSERVABILITY_LOGS (SNOWFLAKE.LOCAL) function. This function has the following required arguments:

database: The database containing the agent.
schema: The schema containing the agent.
agent_name: The name of the agent.
agent_type: CORTEX AGENT or EXTERNAL AGENT. This value is case-insensitive.

This function returns a table of event data described in Evaluation results table format. The following example checks for errors and warnings for a run called run-1, where the agent is named evaluated_agent stored on the schema eval_db.eval_schema:

SELECT * FROM TABLE(SNOWFLAKE.LOCAL.GET_AI_OBSERVABILITY_LOGS(
  'eval_db',
  'eval_schema',
  'evaluated_agent',
  'CORTEX AGENT')
)
  WHERE TRUE
  AND (record:"severity_text"='ERROR' or record:"severity_text"='WARN')
  AND record_attributes:"snow.ai.observability.run.name"='run-1';

Note

The fields of record and record_attributes are subject to change, but the fields record:"severity_text" and record_attributes:"snow.ai.observability.run.name" are guaranteed to be present in AI Observability logs.

Agent Evaluation YAML specification¶

To define the YAML file to configure an Agent Evaluation, including defining custom metrics, there are three top-level keys:

(Optional) dataset: A definition of how to create a dataset for the evaluation. This value is optional when using a YAML specification to start an evaluation in Snowsight, or when using an existing dataset.
evaluation: Settings for the agent to be evaluated.
metrics: The metrics recorded during an evaluation run, including definitions for custom metrics.

Dataset definition¶

The dataset value defines a new dataset from existing table data, mapping columns for the input query and ground truth. For the structure required for your ground_truth column, see Dataset format. The keys for the dataset value are:

dataset_type: The string constant “CORTEX AGENT”. This value is case-insensitive.
table_name: The fully qualified name of the table to use for the dataset’s contents.
dataset_name: The name of the created dataset.
column_mapping: The mapping of the required evaluation input column query_text and output column ground_truth to columns of the table to create the dataset from.

The resulting dataset is stored in the same database and schema as the table it’s constructed from.

Important

When you call EXECUTE_AI_EVALUATION with START and the YAML still contains dataset:, Snowflake attempts to create the dataset on every run. If a dataset with the same dataset_name already exists, the run can fail (for example with an error that a dataset or internal dataset version already exists). That can happen even when you only change run_name between runs, or after a previous attempt failed after the dataset was created.

Pattern for repeated runs on the same dataset: Remove the entire dataset: top-level block from the YAML. Keep evaluation: (with source_metadata referencing the existing dataset_name) and metrics:. This matches how you run another evaluation against an existing dataset without re-importing the table.

When you need a new dataset from the same or updated source table (for example after you change rows), use a new dataset_name in dataset:, or create a dataset with SYSTEM$CREATE_EVALUATION_DATASET and reference that name in evaluation.source_metadata without embedding dataset: in the YAML you use for the run.

The following example dataset definition shows a dataset named evaluation_input created from the evals_db.evals_schema.evaluation_data table, using the user_question as input and expected_outcome to define ground truth:

dataset:
  dataset_type: "CORTEX AGENT"
  table_name: "evals_db.evals_schema.evaluation_data"
  dataset_name: "evaluation_input"
  column_mapping:
    query_text: "user_question"
    ground_truth: "expected_outcome"

Agent configuration¶

The evaluation value sets the configuration for the agent to conduct an evaluation against. The keys for the evaluation value are:

agent_params: A dictionary describing the agent to conduct the evaluation for. This value uses the keys:
- agent_name: The name of the agent to evaluate.
- agent_type: The string constant “CORTEX AGENT”. This value is case-insensitive.
(Optional) run_params: Metadata for identifying this evaluation run. This value uses the keys:
- (Optional) label: The label for this evaluation.
- (Optional) description: A detailed description of the evaluation.
source_metadata: A dictionary describing the dataset used for the evaluation. This value uses the keys:
- type: The string constant dataset. This value is case-sensitive.
- dataset_name: The name of the dataset to use.

The following example agent configuration runs an agent named evaluated_agent with the label Basic evaluation, using the dataset evaluation_input:

evaluation:
  agent_params:
    agent_name: "evaluated_agent"
    agent_type: "CORTEX AGENT"
  run_params:
    label: "Basic evaluation"
  source_metadata:
    type: "dataset"
    dataset_name: "evaluation_input"

Note

Note that the agent name is relative to the current database and schema. You can also provide the fully qualified name of the agent.

Metrics selection¶

The metrics value is a sequence of metrics to evaluate, including your own custom metric definitions. The accepted values for pre-defined metrics are:

answer_correctness: Measure how closely the expected ground truth answer for a given input query matches the actual response streamed from the agent.
logical_consistency: Measure consistency across agent instructions, planning, and tool calls. This metric is reference-free and doesn’t use a dataset.

Defining a custom metric¶

You can define your own custom metric by providing an identifier, prompt, and score ranges. The prompt you provide is passed to an LLM judge along with run traces to conduct your custom evaluation. Custom metrics have the following required key-value pairs:

name: The name of the metric.
score_ranges: A mapping that defines low, medium, and high-quality score ranges. This mapping uses the keys:
- min_score: The score range used to identify low-quality results, as a two-element sequence of the inclusive lower bound to exclusive upper bound.
- median_score: The score range used to identify medium-quality results, as a two-element sequence of the inclusive lower bound to inclusive upper bound.
- max_score: The score range used to identify high-quality results, as a two-element sequence of the exclusive lower bound to inclusive upper bound.
prompt: The prompt template to pass to the LLM judge along with the agent run trace data.

Important

This template must include a scoring mechanism which produces a numeric value represented in the ranges provided for score_ranges.

A custom metric’s prompt is able to reference the trace data generated by the agent during an evaluation run. Snowflake passes the entire trace as input to the LLM judge, but you can emphasize certain information by using a replacement string that references data in a GET_AI_RECORD_TRACE column directly. The following replacement strings are available:


Replacement string	GET_AI_RECORD_TRACE column
`{{input}}`	INPUT
`{{output}}`	OUTPUT
`{{ground_truth}}`	GROUND_TRUTH
`{{tool_info}}`	TOOL
`{{start_timestamp}}`	START_TIMESTAMP
`{{duration}}`	DURATION_MS
`{{span_id}}`	SPAN_ID
`{{span_type}}`	SPAN_TYPE
`{{span_name}}`	SPAN_NAME
`{{llm_model}}`	LLM_MODEL
`{{error}}`	ERROR
`{{status}}`	STATUS

Metrics configuration example¶

The following example defines a metrics configuration that enables answer correctness and logical consistency checks, and also defines a custom relevance metric which returns a score between 1-10 based on how ground truth compares against agent output:

metrics:
  # Built-in metrics
  - "answer_correctness"
  - "logical_consistency"
  # Custom metric with prompt
  - name: "relevance"
    score_ranges:
      min_score: [1, 3]
      median_score: [4, 6]
      max_score: [7, 10]
    prompt: |
      Evaluate the relevance of the agent's response to the user's query.
      Rate from 1-10 where:
      1 = Completely irrelevant
      4 = Somewhat irrelevant
      6 = Neutral
      8 = Mostly relevant
      10 = Highly relevant and on-topic

      You can compare the {{output}} with the {{ground_truth}} to help you understand if the contents are relevant or not

      Consider:
      - Does the response address the user's question?
      - Is the information provided appropriate to the context?
      - Are there any tangential or off-topic elements?

Full example configuration¶

Combining all of the previous example sections gives a full Agent Evaluation configuration:

# Optional: Create dataset before running evaluation
dataset:
  dataset_type: "CORTEX AGENT"
  table_name: "EVALS_DB.EVALS_SCHEMA.EVALUATION_DATA"
  dataset_name: "EVALUATION_INPUT"
  column_mapping:
    query_text: "user_question"
    ground_truth: "expected_outcome"

# Evaluation task configuration
evaluation:
  agent_params:
    agent_name: "evaluated_agent"
    agent_type: "CORTEX AGENT"
  run_params:
    label: "Basic evaluation"
  source_metadata:
    type: "dataset"
    dataset_name: "EVALUATION_INPUT"

metrics:
  # Built-in metrics (simple strings)
  - "answer_correctness"
  - "logical_consistency"

  # Custom metric definition
  - name: "relevance"
    score_ranges:
      min_score: [1, 3]
      median_score: [4, 6]
      max_score: [7, 10]
    prompt: |
      Evaluate the relevance of the agent's response to the user's query.
      Rate from 1-10 where:
      1 = Completely irrelevant
      4 = Somewhat irrelevant
      6 = Neutral
      8 = Mostly relevant
      10 = Highly relevant and on-topic

      You can compare the {{output}} with the {{ground_truth}} to help you understand if the contents are relevant or not

      Consider:
      - Does the response address the user's question?
      - Is the information provided appropriate to the context?
      - Are there any tangential or off-topic elements?

Upload configuration to a stage¶

Agent Evaluation configurations are required to have a specific file format for Snowflake to parse them. The following snippet demonstrates creating the required yaml_file_format on the schema evals_db.evals_schema, then creates the stage evaluation_config to upload an agent configuration to:

CREATE OR REPLACE FILE FORMAT evals_db.evals_schema.yaml_file_format
  TYPE = 'CSV'
  FIELD_DELIMITER = NONE
  RECORD_DELIMITER = '\n'
  SKIP_HEADER = 0
  FIELD_OPTIONALLY_ENCLOSED_BY = NONE
  ESCAPE_UNENCLOSED_FIELD = NONE;

CREATE OR REPLACE STAGE evals_db.evals_schema.evaluation_config
  FILE_FORMAT = evals_db.evals_schema.yaml_file_format;

Upload your configuration to a created stage through Snowsight by navigating to In the navigation menu, select Ingestion » Add Data and selecting Load files into a Stage. You can also use the SQL PUT command to upload a local YAML file. The following example demonstrates copying the local file /Users/dev/evaluation_config.yaml to the stage evals_db.evals_schema.evaluation_config:

PUT file:///Users/dev/evaluation_config.yaml @evals_db.evals_schema.evaluation_config
  AUTO_COMPRESS='false'
  OVERWRITE=TRUE;

If you create your YAML in a Workspace, you can copy it from your active workspace to a stage. The following example copies the file evaluation_config.yaml from your workspace to the stage evals_db.evals_schema.evaluation_config:

COPY FILES INTO @evals_db.evals_schema.evaluation_config
  FROM 'snow://workspace/USER$.PUBLIC.DEFAULT$/versions/live'
  FILES=('custom_metric_config.yaml');

Tip

Snowflake recommends keeping your YAML file uncompressed.

Evaluation results table format¶

Functions which return information about a Cortex Agent evaluation all produce a table with the following columns:


Column	Data type	Description
RECORD_ID	VARCHAR	The unique identifier assigned by Snowflake for this evaluation record.
INPUT_ID	VARCHAR	The unique identifier assigned by Snowflake for this evaluation input.
REQUEST_ID	VARCHAR	The unique identifier assigned by Snowflake for this request.
TIMESTAMP	TIMESTAMP_TZ	The time (in UTC) at which the request was made.
DURATION_MS	INT	The amount of time, in milliseconds, that it took for the agent to return a response.
INPUT	VARCHAR	The query string used as input for this evaluation record.
OUTPUT	VARCHAR	The response returned by the Cortex Agent for this evaluation record.
ERROR	VARCHAR	Information about any errors that occurred during the request.
GROUND_TRUTH	VARCHAR	The ground truth information used to evaluate this record’s Cortex Agent output. This column holds the JSON from your dataset’s ground truth column, serialized as a string. For how `{{ground_truth}}` in custom metrics relates to this value, see the notes under Evaluation results table format.
METRIC_NAME	VARCHAR	The name of the metric evaluated for this record.
EVAL_AGG_SCORE	NUMBER	The evaluation score assigned for this record.
METRIC_TYPE	VARCHAR	The type of metric being evaluated. For built-in metrics, the value is `system`. For custom metrics, the value is `custom`.
METRIC_STATUS	VARIANT	A map containing information about the agent’s HTTP response for this record, with the following keys: `status`: The HTTP status code of the response. `message`: The HTTP message sent in the status response.
METRIC_CALLS	ARRAY	An array of VARIANT values that contain information about the computed metric. Each array entry contains the metric’s criteria, an explanation of the metric score, and metadata. The keys of each entry are: `criteria`: The criteria used by an LLM judge to evaluate response correctness. `explanation`: An explanation of why the score was assigned. `full_metadata`: A VARIANT value that contains metadata and information about this metric’s processing by the LLM judge. The keys of this map include: `completion_tokens`: The number of output tokens generated by the LLM for this metric evaluation call. `normalized_score`: The original evaluation score normalized to the range [0.0, 1.0], rounded to two decimal places. `original_score`: The original score assigned by this metric evaluation for the record. `prompt_tokens`: The number of tokens taken up by the prompt provided to the LLM judge. `total_tokens`: The total number of tokens used by the LLM judge for this computation.
TOTAL_INPUT_TOKENS	INT	The total number of tokens used to process the input query.
TOTAL_OUTPUT_TOKENS	INT	The total number of output tokens produced by the Cortex Agent.
LLM_CALL_COUNT	INT	Counts the number of times any LLM was called, either by the agent or an evaluation judge.

The GROUND_TRUTH column contains the full JSON from your dataset’s ground truth VARIANT, serialized as a string. In custom metric prompts, the {{ground_truth}} replacement string is substituted with that same serialized content, so a custom LLM judge can use any JSON shape you stored (not only keys such as ground_truth_output or ground_truth_invocations). System metrics still require JSON that matches what each metric expects (for example, ground_truth_output for answer correctness). For dataset column requirements, see Dataset format.

Model availability¶

Agent Evaluations currently only supports the claude-4-sonnet and claude-3-5-sonnet models, using cross-region inference. Snowflake automatically chooses from these models based on your account settings.


Model	Cross Cloud (Any Region)	AWS US	AWS US Commercial Gov	AWS EU	AWS APJ
`claude-4-sonnet`	✔	✔	✔	✔	✔

Known limitations¶

Cortex Agent evaluations are subject to the following limitations:

Agent response times and throughput: The number of inputs that can be processed during an evaluation is constrained by agent response times and the amount of trace detail. If you experience timeouts or long delays in your evaluation, split your evaluation data. For example, if you have queries which are guaranteed to invoke many different tools, you can partition data by common tool invocation. If you have a custom evaluation that results in timeouts, refine or shorten your prompt. You may also want to consider splitting custom evaluations to only focus on one specific element of your agent’s output.
Ground truth staleness: Depending on how you word your input queries, results may drift over time and result in less accurate evaluation results. In particular you should try and scope input queries to specific, absolute dates and times. As an example, both of the input queries What was our revenue? and What was our revenue for the first quarter? will experience drift, while the query What was our revenue between January and March of 2025? is scoped to a specific window of time that can be consistently referenced in the evaluation data.

Cost Considerations¶

Agent Evaluations run a Cortex Agent to create output for evaluation, and LLM judges to compute the evaluation metrics. You’re charged for each run of the agent against a ground truth query. The evaluation’s LLM judges are run by the AI_COMPLETE function, and you incur charges based on the model Snowflake selects for judging. Additionally, you’re charged for the following:

Warehouse charges for tasks used to manage evaluation runs
Warehouse charges for queries used to compute evaluation metrics
Storage charges for datasets and evaluation results
Warehouse charges to retrieve evaluation results viewed in Snowsight

For more information on estimating costs, see Understanding overall cost. Refer to the Snowflake Service Consumption Table for full cost information.