Cortex Agent evaluations

Cortex Agent evaluations allow you to monitor your agent’s behavior and performance. Evaluate your agent against both ground truth and reference-free evaluation metrics. During evaluation, your agent’s activity is traced and monitored so you can ensure that each step in the process advances towards your end goal.

Snowflake offers the following metrics to evaluate your agent against:

  • Answer correctness – How closely the answer from an agent to your prepared query matches an expected answer. This metric is most useful when the dataset powering your Cortex Agent is static.

  • Logical consistency – Measures consistency across agent instructions, planning, and tool calls. This metric is reference-free, meaning you don’t need to prepare any information in your dataset for evaluation.

Snowflake also allows you to create custom evaluation metrics that use the LLM judging process to measure context critical to your Agent’s domain and use case. Custom metrics use an LLM prompt and scoring methodology, which are passed to the evaluation judging system to produce a score.

For additional details about how agent evaluations are conducted on Snowflake, including the LLM judging system used for reference-free evaluations, see the Snowflake engineering blog What’s Your Agent’s GPA? A Framework for Evaluating AI Agent Reliability. For an example of running an Agent Evaluation programmatically, see the guide Getting Started with Cortex Agent Evaluations.

Access control requirements

The ability to run a Cortex Agent evaluation requires a role with the following:

  • The DATABASE ROLE SNOWFLAKE.CORTEX_USER role

  • The EXECUTE TASK ON ACCOUNT permission

  • The USAGE permission on the database containing your agent

  • The following permissions on the schema containing your agent:

    • USAGE

    • CREATE FILE FORMAT ON SCHEMA

    • CREATE TASK

    • EXECUTE TASK

  • The USAGE permission on the database containing your evaluation data

  • The following permissions on the schema containing your evaluation data:

    • USAGE

    • EXECUTE TASK

    • If creating a dataset from an input table, CREATE DATASET ON SCHEMA

  • The USAGE or OWNERSHIP privilege on your agent

  • The MONITOR or OWNERSHIP privilege on your agent

  • If using an agent evaluation configuration, READ privilege on the stage containing the configuration file.

If the agent being evaluated uses tools, your role also needs access to all of them.

Additionally, if working with evaluations in Snowsight, the role you use to run or an inspect an evaluation needs the USAGE privilege on your default warehouse.

Prepare an evaluation dataset

Before starting a Cortex Agent evaluation, prepare a table containing your evaluation inputs. This table is used to create a dataset for your evaluation to run against. To learn more about datasets on Snowflake, see Snowflake Datasets.

Cortex Code

To have Cortex Code assist you with creating a dataset for your evaluation, use the dataset-curation sub-skill of the Cortex Code cortex-agent skill. For more information about Cortex Code skills, see Cortex Code CLI - Skills.

Dataset format

The table used to create a dataset for evaluation has an input query column of type VARCHAR that represents your query, and an output column of type VARIANT that contains a description of expected agent behavior. This single output column is used as the ground truth by the LLM judge.

Values in the output column have one key, ground_truth_output. The value of this key is used in answer correctness evaluation. LLM judges use ground truth to evaluate your agent’s output by including it in their prompt.

Tip

Take advantage of the fact that ground truth is included in an LLM prompt by using natural language to describe a type of response, in addition to exact or semantic response matches. For example, you could provide a ground truth of Output is in the following JSON format ... followed by a string containing either a description of the structure or a JSON example itself. If you need a more rigorous examination of output based on a full custom prompt, create a custom metric.

To bring a JSON dataset into a Snowflake table, use the PARSE_JSON SQL function. The following example creates a new table agent_evaluation_data to use for an evaluation dataset, and inserts a row for the input query What was the temperature in San Francisco on August 2nd 2019? with the ground truth of The temperature was 14 degrees Celsius in San Francisco on August 2nd, 2019..

CREATE OR REPLACE TABLE agent_evaluation_data (
    input_query VARCHAR
);

INSERT INTO agent_evaluation_data
  SELECT
    'What was the temperature in San Francisco on August 2nd 2019?',
    PARSE_JSON('
      {
        "ground_truth_output": "The temperature was 14 degrees Celsius in San Francisco on August 2nd, 2019.",
      }
    ');

Important

The functions OBJECT_CONSTRUCT and ARRAY_CONSTRUCT return non-VARIANT results. Use a function that produces a VAIRANT from your raw input like PARSE_JSON, or call TO_VARIANT to guarantee the value type.

Data you provide in the ground_truth column that isn’t used by a selected metric is ignored. When conducting an evaluation run with only reference-free metrics, you can leave the output column empty.

When running your first evaluation, you’ll have the option to create a new dataset from an existing table.

Start an agent evaluation

Cortex Code

To have Cortex Code run an evaluation, use the evaluate-cortex-agent sub-skill of the Cortex Code cortex-agent skill. For more information about Cortex Code skills, see Cortex Code CLI - Skills.

Snowsight

Note

Agent evaluations run as your currently selected role in Snowsight, not your default role. Make sure a role with the correct permissions is active before starting an evaluation.

Begin your evaluation of a Cortex Agent by doing the following:

  1. Sign in to Snowsight.

  2. In the navigation menu, select AI & ML » Agents.

  3. Select the agent you want to conduct an evaluation of.

  4. Select the Evaluations tab.

  5. Select New evaluation run.

    The New evaluation run modal opens.

  6. In the Name field, provide a name for your evaluation. This name should be unique for the agent being evaluated.

  7. Optional: In the Description field, provide any comments for the evaluation.

  8. Select Next.

    This advances to the Select dataset modal.

  9. Select the dataset used to evaluate your agent. You can choose either Existing dataset or Create new dataset.

    To use an existing dataset:

    1. From the Database and schema list, select the database and schema containing your dataset.

    2. From the Select dataset list, select your dataset.

    To create a new dataset:

    1. From the Source table - Database and schema list, select the database and schema containing the table you want to import to a dataset.

    2. From the Select source table list, select your source table.

    3. From the New dataset location - Database and schema list, select the database and schema to place your new dataset.

    4. In the Dataset name field, enter your dataset name. This name needs to be unique among the schema-level objects in your selected schema.

  10. Select Next.

    This advances to the Select metrics modal.

  11. From the Input query list, select the column of your dataset which contains the input queries.

  12. For each of the System metrics, change the toggle to active for any metric you want included in your evaluation. Select the column of your dataset containing the ground truth for your evaluation.

  13. (Optional) To conduct a custom evaluation, toggle on Custom metrics.

    1. Select the database and schema containing the stage where your custom evaluation configuration is stored.

    2. Select the stage where your custom evaluation configuration is stored.

    3. Select the YAML configuration file for your custom evaluation.

      Note

      In Snowsight, only the custom evaluation definitions are loaded from your YAML configuration. The rest of the YAML file must still be valid. For the evaluation YAML specification, see Agent Evaluation YAML specification.

    4. For each custom metric, change the toggle to active if you want it included in your evaluation. Select the column of your dataset containing the ground truth for this evaluation.

  14. Select Create to create the evaluation and begin the evaluation process.

At any point, you can select Cancel to cancel creating the evaluation, or select Prev to return to the previous modal.

SQL

To start or retrieve information on an evaluation with SQL, use the EXECUTE_AI_EVALUATION function. This function has the following required arguments:

  • evaluation_job: A string value of ‘START’ or ‘STATUS’.

  • run_parameters: A SQL OBJECT containing the key run_name, with a value of the name of your run.

  • config_file_path: A stage file path pointing to your run configuration YAML file. This path can’t be a signed URL. For the evaluation YAML specification, see Agent Evaluation YAML specification.

Use the evaluation_job value ‘START’ to start an evaluation. The following example starts a run called run-1 using the agent evaluation configuration from @eval_db.eval_schema.metrics/agent_evaluation_config.yaml:

CALL EXECUTE_AI_EVALUATION(
  'START',
  OBJECT_CONSTRUCT('run_name', 'run-1'),
  '@eval_db.eval_schema.metrics/agent_evaluation_config.yaml'
);

After a run starts, you can query its progress with the evaluation_job value ‘STATUS’. This call returns a table in the format used for AI Observability Runs. The following example queries the status of the agent evaluation started from the previous example:

CALL EXECUTE_AI_EVALUATION(
  'STATUS',
  OBJECT_CONSTRUCT('run_name', 'run-1'),
  '@eval_db.eval_schema.metrics/agent_evaluation_config.yaml'
);

Tip

You can call the EXECUTE_AI_EVALUATION function from a Task to regularly run an evaluation or check the status of one.

Inspect evaluation results

Evaluation results include information about the requested metrics, details of the agent’s threads of reasoning, and information about the LLM planning stage for each executed trace in the thread.

Cortex Code

Cortex Code offers two sub-skills of the cortex-agent skill. Use the investigate-cortex-agent-evals sub-skill to inspect evaluations and find any issues in your configuration or data. Use the optimize-cortex-agent sub-skill to take results from completed evaluations and improve the performance of your agent.

Snowsight

The Evaluations tab for an agent in Snowsight gives you an overview of every evaluation run and its summary results.

To view evaluation results in Snowsight:

  1. Sign in to Snowsight.

  2. In the navigation menu, select AI & ML » Agents.

  3. Select the agent you want to conduct an evaluation of.

  4. Select the Evaluations tab.

Evaluation runs listing

The summary of run information for each run includes:

  • RUN NAME – The name of the evaluation run.

  • # OF RECORDS – The number of queries performed and answered as part of the run.

  • STATUS – The status of the evaluation run, which is one of:

    • Success indicator – All inputs were evaluated and results are available.

    • A spinner is displayed – The run is in progress, with no information available yet.

    • Warning indicator – The run experienced an error at some point. Some or all metrics may be unavailable for the run.

  • DATASET – The name of the dataset used for the evaluation.

  • AVG DURATION – The average duration of time taken to execute an input query for the run.

  • LOGICAL CONSISTENCY – Average over all inputs of the logical consistency evaluation for the run, if requested.

  • DESCRIPTION – The description of the evaluation run.

  • CREATED – The time at which the run was created and started.

Each custom metric evaluated for this run also receives its own column, defined by the evaluation metric name value. For more information on custom metrics, see Defining a custom metric.

Evaluation run overview

When you select an individual run in Snowsight, you’re presented with the run overview. This overview includes summary averages for each metric evaluated during the run, and a summary of each input execution. The overview for each input execution includes:

  • STATUS – The status of the evaluation run, which is one of:

    • Success indicator – All inputs were evaluated and results are available.

    • A spinner is displayed – The run is in progress, with no information available yet.

    • Warning indicator – The run experienced an error at some point. Some or all metrics may be unavailable for the run.

  • INPUT – The input query used for the evaluation.

  • OUTPUT – The output produced by the agent.

  • DURATION – The length of time taken to process the input and produce output.

  • LOGICAL CONSISTENCY – The logical consistency evaluation for the input, if requested.

  • EVALUATED – The time at which the input was processed.

Each custom metric evaluated for this run also receives its own column, defined by the evaluation metric name value. For more information about custom metrics, see Defining a custom metric.

Record details

When you select an individual input in Snowsight, you’re presented with the Record details view. This view includes three panes: Evaluation results, Thread details, and Trace details.

Evaluation results

Your evaluation results are presented here in detail. Each metric has its own presentation box of overall average across inputs, which can be selected to display a popover containing more information. This popover contains a breakdown of the number of runs which performed at high accuracy (80% or more accurate), medium accuracy (30% or more accurate, but not high accuracy), and which failed.

Thread details

The information logged during the execution of each agent thread. This includes planning and response generation by default, as well as a thread trace for each tool that the agent invoked during that thread.

Trace details

Each trace pane includes input, processing, and output information relevant to that stage of agent execution. This information is the same as that provided by agent monitoring.

SQL

To retrieve raw evaluation details, use the GET_AI_EVALUATION_DATA (SNOWFLAKE.LOCAL) function. This function has the following required arguments:

  • database: The database containing the agent.

  • schema: The schema containing the agent.

  • agent_name: The name of the agent.

  • agent_type: The string constant ‘CORTEX AGENT’. This value is case-insensitive.

  • run_name: The name of the evaluation run to retrieve.

This function returns a table of event data described in Evaluation results table format. The following example displays the full evaluation details for a run called run-1, where the agent is named evaluated_agent stored on the schema eval_db.eval_schema:

SELECT * FROM TABLE(SNOWFLAKE.LOCAL.GET_AI_EVALUATION_DATA(
  'eval_db',
  'eval_schema',
  'evaluated_agent',
  'CORTEX AGENT',
  'run-1')
);

Query traces for a single record

To access a single record from an evaluation trace, use the GET_AI_RECORD_TRACE (SNOWFLAKE.LOCAL) function. This function has the following required arguments:

  • database: The database containing the agent.

  • schema: The schema containing the agent.

  • agent_name: The name of the agent.

  • agent_type: The string constant ‘CORTEX AGENT’. This value is case-insensitive.

  • record_id: The record ID to filter by.

This function returns a table of event data described in Evaluation results table format. The following example displays the trace for the record 9346efc3-5dd6-4038-9b1a-72ca3d3b768c, where the agent is named evaluated_agent stored on the schema eval_db.eval_schema:

SELECT * FROM TABLE(SNOWFLAKE.LOCAL.GET_AI_RECORD_TRACE(
  'eval_db',
  'eval_schema',
  'evaluated_agent',
  'CORTEX AGENT',
  '9346efc3-5dd6-4038-9b1a-72ca3d3b768c'
));

Query evaluation errors and warnings for a run

To access logs for warnings and errors that happened during an evaluation run, use the GET_AI_OBSERVABILITY_LOGS (SNOWFLAKE.LOCAL) function. This function has the following required arguments:

  • database: The database containing the agent.

  • schema: The schema containing the agent.

  • agent_name: The name of the agent.

  • agent_type: The string constant ‘CORTEX AGENT’. This value is case-insensitive.

This function returns a table of event data described in Evaluation results table format. The following example checks for errors and warnings for a run called run-1, where the agent is named evaluated_agent stored on the schema eval_db.eval_schema:

SELECT * FROM TABLE(SNOWFLAKE.LOCAL.GET_AI_OBSERVABILITY_LOGS(
  'eval_db',
  'eval_schema',
  'evaluated_agent',
  'CORTEX AGENT')
)
  WHERE TRUE
  AND (record:"severity_text"='ERROR' or record:"severity_text"='WARN')
  AND record_attributes:"snow.ai.observability.run.name"='run-1';

Note

The fields of record and record_attributes are subject to change, but the fields record:"severity_text" and record_attributes:"snow.ai.observability.run.name" are guaranteed to be present in AI Observability logs.

Agent Evaluation YAML specification

To define the YAML file to configure an Agent Evaluation, including defining custom metrics, there are three top-level keys:

  • (Optional) dataset: A definition of how to create a dataset for the evaluation. This value is optional when using a YAML specification to start an evaluation in Snowsight, or when using an existing dataset.

  • evaluation: Settings for the agent to be evaluated.

  • metrics: The metrics recorded during an evaluation run, including definitions for custom metrics.

Dataset definition

The dataset value defines a new dataset from existing table data, mapping columns for the input query and ground truth. For the structure required for your ground_truth column, see Dataset format. The keys for the dataset value are:

  • dataset_type: The string constant “CORTEX AGENT”. This value is case-insensitive.

  • table_name: The fully qualified name of the table to use for the dataset’s contents.

  • dataset_name: The name of the created dataset.

  • column_mapping: The mapping of the required evaluation input column query_text and output column ground_truth to columns of the table to create the dataset from.

The resulting dataset is stored in the same database and schema as the table it’s constructed from.

The following example dataset definition shows a dataset named evaluation_input created from the evals_db.evals_schema.evaluation_data table, using the user_question as input and expected_outcome to define ground truth:

dataset:
 dataset_type: "CORTEX AGENT"
 table_name: "evals_db.evals_schema.evaluation_data"
 dataset_name: "evaluation_input"
 column_mapping:
   query_text: "user_question"
   ground_truth: "expected_outcome"

Agent configuration

The evaluation value sets the configuration for the agent to conduct an evaluation against. The keys for the evaluation value are:

  • agent_params: A dictionary describing the agent to conduct the evaluation for. This value uses the keys:

    • agent_name: The name of the agent to evaluate.

    • agent_type: The string constant “CORTEX AGENT”. This value is case-insensitive.

  • (Optional) run_params: Metadata for identifying this evaluation run. This value uses the keys:

    • (Optional) label: The label for this evaluation.

    • (Optional) description: A detailed description of the evaluation.

  • source_metadata: A dictionary describing the dataset used for the evaluation. This value uses the keys:

    • type: The string constant “DATASET”. This value is case-insensitive.

    • dataset_name: The name of the dataset to use.

The following example agent configuration runs an agent named evaluated_agent with the label Basic evaluation, using the dataset evaluation_input:

evaluation:
 agent_params:
   agent_name: "evaluated_agent"
   agent_type: "CORTEX AGENT"
  run_params:
   label: "Basic evaluation"
  source_metadata:
   type: "DATASET"
   dataset_name: "evaluation_input"

Metrics selection

The metrics value is a sequence of metrics to evaluate, including your own custom metric definitions. The accepted values for pre-defined metrics are:

  • answer_correctness: Measure the agent’s response correctness against a ground truth output.

  • logical_consistency: Measure consistency across agent instructions, planning, and tool calls. This metric is reference-free and doesn’t use a dataset.

Defining a custom metric

You can define your own custom metric by providing an identifier, prompt, and score ranges. The prompt you provide is passed to an LLM judge along with run traces to conduct your custom evaluation. Custom metrics have the following required key-value pairs:

  • name: The name of the metric.

  • score_ranges: A mapping that defines low, medium, and high-quality score ranges. This mapping uses the keys:

    • min_score: The score range used to identify low-quality results, as a two-element sequence of the inclusive lower bound to exclusive upper bound.

    • median_score: The score range used to identify medium-quality results, as a two-element sequence of the inclusive lower bound to inclusive upper bound.

    • max_score: The score range used to identify high-quality results, as a two-element sequence of the exclusive lower bound to inclusive upper bound.

  • prompt: The prompt template to pass to the LLM judge along with the agent run trace data.

    Important

    This template must include a scoring mechanism which produces a numeric value represented in the ranges provided for score_ranges.

A custom metric’s prompt is able to reference the trace data generated by the agent during an evaluation run. Snowflake passes the entire trace as input to the LLM judge, but you can emphasize certain information by using a replacement string that references data in a GET_AI_RECORD_TRACE column directly. The following replacement strings are available:

Replacement string

GET_AI_RECORD_TRACE column

{{input}}

INPUT

{{output}}

OUTPUT

{{ground_truth}}

GROUND_TRUTH

{{tool_info}}

TOOL

{{start_timestamp}}

START_TIMESTAMP

{{duration}}

DURATION_MS

{{span_id}}

SPAN_ID

{{span_type}}

SPAN_TYPE

{{span_name}}

SPAN_NAME

{{llm_model}}

LLM_MODEL

{{error}}

ERROR

{{status}}

STATUS

Metrics configuration example

The following example defines a metrics configuration that enables answer correctness and logical consistency checks, and also defines a custom relevance metric which returns a score between 1-10 based on how ground truth compares against agent output:

metrics:
  # Built-in metrics
  - "answer_correctness"
  - "logical_consistency"
  # Custom metric with prompt
  - name: "relevance"
    score_ranges:
      min_score: [1, 3]
      median_score: [4, 6]
      max_score: [7, 10]
    prompt: |
      Evaluate the relevance of the agent's response to the user's query.
      Rate from 1-10 where:
      1 = Completely irrelevant
      4 = Somewhat irrelevant
      6 = Neutral
      8 = Mostly relevant
      10 = Highly relevant and on-topic

      You can compare the {{output}} with the {{ground_truth}} to help you understand if the contents are relevant or not

      Consider:
      - Does the response address the user's question?
      - Is the information provided appropriate to the context?
      - Are there any tangential or off-topic elements?

Full example configuration

Combining all of the previous example sections gives a full Agent Evaluation configuration:

# Optional: Create dataset before running evaluation
dataset:
  dataset_type: "CORTEX AGENT"
  table_name: "EVALS_DB.EVALS_SCHEMA.EVALUATION_DATA"
  dataset_name: "EVALUATION_INPUT"
  column_mapping:
    query_text: "user_question"
    ground_truth: "expected_outcome"

# Evaluation task configuration
evaluation:
 agent_params:
   agent_name: "evaluated_agent"
   agent_type: "CORTEX AGENT"
  run_params:
   label: "Basic evaluation"
  source_metadata:
   type: "DATASET"
   dataset_name: "EVALUATION_INPUT"

  # Built-in metrics (simple strings)
  - "answer_correctness"
  - "logical_consistency"

  # Custom metric definition
  - name: "relevance"
    score_ranges:
      min_score: [1, 3]
      median_score: [4, 6]
      max_score: [7, 10]
    prompt: |
      Evaluate the relevance of the agent's response to the user's query.
      Rate from 1-10 where:
      1 = Completely irrelevant
      4 = Somewhat irrelevant
      6 = Neutral
      8 = Mostly relevant
      10 = Highly relevant and on-topic

      You can compare the {{output}} with the {{ground_truth}} to help you understand if the contents are relevant or not

      Consider:
      - Does the response address the user's question?
      - Is the information provided appropriate to the context?
      - Are there any tangential or off-topic elements?

Upload configuration to a stage

Agent Evaluation configurations are required to have a specific file format for Snowflake to parse them. The following snippet demonstrates creating the required yaml_file_format on the schema evals_db.evals_schema, then creates the stage evaluation_config to upload an agent configuration to:

CREATE OR REPLACE FILE FORMAT evals_db.evals_schema.yaml_file_format
  TYPE = 'CSV'
  FIELD_DELIMITER = NONE
  RECORD_DELIMITER = '\n'
  SKIP_HEADER = 0
  FIELD_OPTIONALLY_ENCLOSED_BY = NONE
  ESCAPE_UNENCLOSED_FIELD = NONE;

CREATE OR REPLACE STAGE evals_db.evals_schema.evaluation_config
  FILE_FORMAT = evals_db.evals_schema.yaml_file_format;

Upload your configuration to a created stage through Snowsight by navigating to In the navigation menu, select Ingestion » Add Data and selecting Load files into a Stage. You can also use the SQL PUT command to upload a local YAML file. The following example demonstrates copying the local file /Users/dev/evaluation_config.yaml to the stage evals_db.evals_schema.evaluation_config:

PUT file:///Users/dev/evaluation_config.yaml @evals_db.evals_schema.evaluation_config
  AUTO_COMPRESS='false'
  OVERWRITE=TRUE;

If you create your YAML in a Workspace, you can copy it from your active workspace to a stage. The following example copies the file evaluation_config.yaml from your workspace to the stage evals_db.evals_schema.evaluation_config:

COPY FILES INTO @evals_db.evals_schema.evaluation_config
  FROM 'snow://workspace/USER$.PUBLIC.DEFAULT$/versions/live'
  FILES=('custom_metric_config.yaml');

Tip

Snowflake recommends keeping your YAML file uncompressed.

Evaluation results table format

Functions which return information about a Cortex Agent evaluation all produce a table with the following columns:

Column

Data type

Description

RECORD_ID

VARCHAR

The unique identifier assigned by Snowflake for this evaluation record.

INPUT_ID

VARCHAR

The unique identifier assigned by Snowflake for this evaluation input.

REQUEST_ID

VARCHAR

The unique identifier assigned by Snowflake for this request.

TIMESTAMP

TIMESTAMP_TZ

The time (in UTC) at which the request was made.

DURATION_MS

INT

The amount of time, in milliseconds, that it took for the agent to return a response.

INPUT

VARCHAR

The query string used as input for this evaluation record.

OUTPUT

VARCHAR

The response returned by the Cortex Agent for this evaluation record.

ERROR

VARCHAR

Information about any errors that occurred during the request.

GROUND_TRUTH

VARCHAR

The ground truth information used to evaluate this record’s Cortex Agent output.

METRIC_NAME

VARCHAR

The name of the metric evaluated for this record.

EVAL_AGG_SCORE

NUMBER

The evaluation score assigned for this record.

METRIC_TYPE

VARCHAR

The type of metric being evaluated. For built-in metrics, the value is system. For custom metrics, the value is custom.

METRIC_STATUS

VARIANT

A map containing information about the agent’s HTTP response for this record, with the following keys:

  • status: The HTTP status code of the response.

  • message: The HTTP message sent in the status response.

METRIC_CALLS

ARRAY

An array of VARIANT values that contain information about the computed metric. Each array entry contains the metric’s criteria, an explanation of the metric score, and metadata. The keys of each entry are:

  • criteria: The criteria used by an LLM judge to evaluate response correctness.

  • explanation: An explanation of why the score was assigned.

  • full_metadata: A VARIANT value that contains metadata and information about this metric’s processing by the LLM judge. The keys of this map include:

    • completion_tokens: The number of output tokens generated by the LLM for this metric evaluation call.

    • guard_tokens: The number of tokens consumed by Cortex Guard for this metric evaluation call.

    • normalized_score: The original evaluation score normalized to the range [0.0, 1.0], rounded to two decimal places.

    • original_score: The original score assigned by this metric evaluation for the record.

    • prompt_tokens: The number of tokens taken up by the prompt provided to the LLM judge.

    • total_tokens: The total number of tokens used by the LLM judge for this computation.

TOTAL_INPUT_TOKENS

INT

The total number of tokens used to process the input query.

TOTAL_OUTPUT_TOKENS

INT

The total number of output tokens produced by the Cortex Agent.

LLM_CALL_COUNT

INT

Counts the number of times any LLM was called, either by the agent or an evaluation judge.

Model availability

Agent Evaluations currently only supports the claude-4-sonnet and claude-3-5-sonnet models, using cross-region inference. Snowflake automatically chooses from these models based on your account settings.

Model

Cross Cloud (Any Region)

AWS US

AWS US Commercial Gov

AWS EU

AWS APJ

claude-4-sonnet

claude-3.5-sonnet

Known limitations

Cortex Agent evaluations are subject to the following limitations:

  • Agent response times and throughput: The number of inputs that can be processed during an evaluation is constrained by agent response times and the amount of trace detail. If you experience timeouts or long delays in your evaluation, split your evaluation data. For example, if you have queries which are guaranteed to invoke many different tools, you can partition data by common tool invocation. If you have a custom evaluation that results in timeouts, refine or shorten your prompt. You may also want to consider splitting custom evaluations to only focus on one specific element of your agent’s output.

  • Ground truth staleness: Depending on how you word your input queries, results may drift over time and result in less accurate evaluation results. In particular you should try and scope input queries to specific, absolute dates and times. As an example, both of the input queries What was our revenue? and What was our revenue for the first quarter? will experience drift, while the query What was our revenue between January and March of 2025? is scoped to a specific window of time that can be consistently referenced in the evaluation data.

Cost Considerations

Agent Evaluations run a Cortex Agent to create output for evaluation, and LLM judges to compute the evaluation metrics. You’re charged for each run of the agent against a ground truth query. The evaluation’s LLM judges are run by the AI_COMPLETE function, and you incur charges based on the model Snowflake selects for judging. Additionally, you’re charged for the following:

  • Warehouse charges for tasks used to manage evaluation runs

  • Warehouse charges for queries used to compute evaluation metrics

  • Storage charges for datasets and evaluation results

  • Warehouse charges to retrieve evaluation results viewed in Snowsight

For more information on estimating costs, see Understanding overall cost. Refer to the Snowflake Service Consumption Table for full cost information.