Cortex Analyst evaluations¶

Cortex Analyst evaluations let you measure and improve the performance of your semantic views that are used for SQL generation. Evaluations work by testing your semantic views against their own verified queries as the ground truth. This gives confidence that your semantic view can handle queries that users rely on, which can also translate to higher accuracy for SQL in general.

Evaluations measure accuracy by executing the SQL generated by Cortex Analyst and comparing the results against your verified queries. Regression metrics are aggregated to track verified queries that were previously answered correctly but are now failing. In addition to these correctness metrics, latency is recorded to track performance of queries. These metrics can be used to identify weaknesses and iteratively refine your semantic views to improve SQL accuracy while preventing regressions.

Access control requirements¶

The ability to run a Cortex Analyst evaluation requires a role with the following:

The DATABASE ROLE SNOWFLAKE.CORTEX_USER
The USE AI FUNCTIONS account-level privilege (or the per-function USE AI FUNCTION AI_COMPLETE privilege). Evaluations compute their metrics with the AI_COMPLETE function using the LLM-as-a-judge technique, so the role that runs the evaluation must be able to call AI_COMPLETE. This privilege is granted to the PUBLIC role by default. If your account has revoked it from PUBLIC, grant it explicitly. For more information, see Cortex LLM privileges.
The EXECUTE TASK ON ACCOUNT global privilege
The CREATE TASK privilege on the schema containing your semantic view
The CREATE DATASET ON SCHEMA privilege on the schema containing your semantic view
The SELECT privilege on the semantic view and the tables referenced in the semantic view
The MONITOR privilege on the semantic view

Note

All of the above privileges must be granted under a single primary role. Evaluation runs are executed using Snowflake tasks, which do not consider secondary role privileges.

Prepare an evaluation set¶

Cortex Analyst evaluations use verified queries (VQs) as the evaluation set. Each verified query pairs a natural language question with its expected SQL answer. Before running an evaluation, you need at least one verified query associated with your semantic view.

If you don’t have any verified queries yet, add them through the semantic view editor in Snowsight. For more information, see Cortex Analyst Verified Query Repository.

How verified queries are used during evaluation¶

When you select verified queries for an evaluation run, Cortex Analyst creates a temporary copy of your semantic view with those selected queries removed. Cortex Analyst then generates SQL using this temporary copy, which does not contain the evaluation queries. This prevents the evaluation queries from influencing SQL generation, ensuring that the evaluation measures how well Cortex Analyst can answer questions without relying on exact matches from verified queries.

Verified queries that you do not select for evaluation remain in the temporary semantic view and continue to guide Cortex Analyst during the evaluation run, just as they would during normal usage.

Note

A verified query can either guide Cortex Analyst at runtime or be used as evaluation ground truth, but not both at the same time. Selecting a verified query for evaluation temporarily removes it from the semantic view so the evaluation result reflects genuine SQL generation ability.

Start a Cortex Analyst evaluation¶

Snowsight¶

Begin your evaluation of a semantic view by doing the following:

Sign in to Snowsight.
In the navigation menu, select AI & ML » Cortex Analyst.
From the list, select the semantic view you want to run the evaluation on.
Select the Evaluations tab.
Select Create evaluation run.
In the Name field, provide a name for your evaluation. This name should be unique for the semantic view being evaluated.
Select Next.

This advances to the Select verified queries modal.
Select which verified queries to include in the evaluation. You can either select all verified queries or select a specific set by checking the corresponding boxes.
Select Run evaluation.

SQL¶

Cortex Analyst evaluation runs can also be started with SQL using the EXECUTE_AI_EVALUATION function. This function accepts the following evaluation_job values:

'START': Start an evaluation run.
'STATUS': Query the progress of an evaluation run.
'CANCEL': Cancel a running evaluation.
'DELETE': Delete a completed evaluation run and its results.

Each call requires the following additional arguments:

run_parameters: A SQL OBJECT containing the key run_name, with a value of the name of your run.
config_file_path: A stage file path pointing to your run configuration YAML file. For the YAML specification, see Analyst evaluation YAML specification.

The following example starts an evaluation run called Evaluation run 1:

CALL EXECUTE_AI_EVALUATION(
  'START',
  OBJECT_CONSTRUCT('run_name', 'Evaluation run 1'),
  '@EVAL_DB.EVAL_SCHEMA.METRICS/analyst_evaluation_config.yaml'
);

After a run starts, you can query its progress:

CALL EXECUTE_AI_EVALUATION(
  'STATUS',
  OBJECT_CONSTRUCT('run_name', 'Evaluation run 1'),
  '@EVAL_DB.EVAL_SCHEMA.METRICS/analyst_evaluation_config.yaml'
);

To cancel or delete a run, replace 'STATUS' with 'CANCEL' or 'DELETE'.

Inspect evaluation results¶

Snowsight¶

The Evaluations tab for a semantic view in Snowsight gives an overview of every evaluation run and a summary of each, including the number of query regressions.

To view evaluation results:

Sign in to Snowsight.
In the navigation menu, select AI & ML » Cortex Analyst.
From the list, select the semantic view you want to view evaluations for.
Select the Evaluations tab.
Select an individual run to see detailed results.

The run detail page shows:

Accuracy – The percentage of verified queries where the generated SQL was judged correct, with an option to Improve the semantic view.
Regressions – The number of verified queries that were previously correct but are now failing.
Latency – Average and per-query response times for Cortex Analyst.
Per-query results – For each verified query: the natural language question, the expected SQL, the generated SQL, and whether the result was correct or incorrect. Select a query to see the detailed comparison.

SQL¶

To retrieve the results of an evaluation run, use the GET_ANALYST_AI_EVALUATION_DATA function. This function has the following required arguments:

database: The database containing the semantic view.
schema: The schema containing the semantic view.
object_name: The name of the semantic view.
object_type: The string constant 'SEMANTIC VIEW'.
run_name: The name of the evaluation run to retrieve.

The following example displays the full evaluation details for a run called Evaluation run 1, where the semantic view is named SEMANTIC_VIEW_EVAL stored on the schema EVAL_DB.EVAL_SCHEMA:

SELECT * FROM TABLE(SNOWFLAKE.LOCAL.GET_ANALYST_AI_EVALUATION_DATA(
  'EVAL_DB',
  'EVAL_SCHEMA',
  'SEMANTIC_VIEW_EVAL',
  'SEMANTIC VIEW',
  'Evaluation run 1')
);

Evaluation results table format¶

The GET_ANALYST_AI_EVALUATION_DATA function returns a table with the following columns:

Column	Data type	Description
RECORD_ID	VARCHAR	The unique identifier assigned by Snowflake for this evaluation record.
INPUT_ID	VARCHAR	The unique identifier assigned by Snowflake for this evaluation input.
REQUEST_ID	VARCHAR	The unique identifier assigned by Snowflake for this request.
TIMESTAMP	TIMESTAMP_LTZ	The time at which the request was made.
DURATION_MS	INT	The amount of time, in milliseconds, that it took for Cortex Analyst to return a response.
INPUT	VARCHAR	The query string used as input for this evaluation record.
OUTPUT	VARCHAR	The response returned by Cortex Analyst for this evaluation record.
ERROR	VARCHAR	Information about any errors which occurred during the request.
GROUND_TRUTH	VARCHAR	The ground truth information used to evaluate this record’s Cortex Analyst output.
METRIC_NAME	VARCHAR	The name of the metric evaluated for this record.
EVAL_AGG_SCORE	NUMBER	The evaluation score assigned for this record.
METRIC_TYPE	VARCHAR	The type of metric being evaluated. For built-in metrics, the value is `system`. For custom metrics, the value is `custom`.
METRIC_STATUS	VARIANT	A map containing information about the evaluation’s HTTP response for this record, with the following keys: `status`: The HTTP status code of the response. `message`: The HTTP message sent in the status response.
METRIC_CALLS	ARRAY	An array of VARIANT values that contain information about the computed metric. Each array entry contains the metric’s criteria and an explanation of the metric score. The keys of each entry are: `criteria`: The criteria used to compute SQL correctness. `explanation`: Details of the compared result sets.

Analyst evaluation YAML specification¶

To trigger evaluation runs programmatically, you need a YAML configuration file uploaded to a Snowflake stage. This section describes the YAML format and how to upload it.

YAML format¶

evaluation:
  analyst_params:
    analyst_name: "SEMANTIC_VIEW_EVAL"
    analyst_type: "SEMANTIC VIEW"
  source_metadata:
    type: "verified_queries"
    # Optional: list specific verified queries by their question text.
    # If omitted, all verified queries are used.
    verified_queries:
      - "What are our top 10 customers from the last 30 days?"
      - "What is the total revenue by region for Q1 2025?"

metrics:
  - "sql_correctness"

analyst_params

analyst_name: The name of the semantic view to run the evaluation against.
analyst_type: The string constant SEMANTIC VIEW.

source_metadata

type: The type of source used as the evaluation data. For Cortex Analyst, the only supported source type is verified_queries.
verified_queries (optional): A list of questions matching the question field of each verified query that should be used as ground truth for the evaluation. If not provided, all verified queries are used.

metrics

The metrics to compute for the evaluation. sql_correctness is the only supported metric for Cortex Analyst evaluations.

Upload configuration to a stage¶

Upload your YAML configuration to a Snowflake stage. The following example creates a file format, creates a stage, and uploads a local configuration file:

CREATE OR REPLACE FILE FORMAT evals_db.evals_schema.yaml_file_format
  TYPE = 'CSV'
  FIELD_DELIMITER = NONE
  RECORD_DELIMITER = '\n'
  SKIP_HEADER = 0
  FIELD_OPTIONALLY_ENCLOSED_BY = NONE
  ESCAPE_UNENCLOSED_FIELD = NONE;

CREATE OR REPLACE STAGE evals_db.evals_schema.metrics
  FILE_FORMAT = evals_db.evals_schema.yaml_file_format;

PUT file:///Users/dev/analyst_evaluation_config.yaml @evals_db.evals_schema.metrics
  AUTO_COMPRESS='false'
  OVERWRITE=TRUE;

Tip

Snowflake recommends keeping your YAML file uncompressed.

Improve your semantic view¶

Use evaluation results to iteratively improve your semantic view. The recommended workflow is:

Run an evaluation to establish a baseline accuracy score.
Inspect a completed run by selecting it from the Evaluations tab to review the expected vs generated SQL for each query.
Optimize your semantic view by selecting Improve in the Accuracy summary box. This starts semantic view optimization, which analyzes the evaluation failures and automatically suggests changes to your semantic view. For more information, see Optimize an existing semantic view or model with verified queries.
Re-run the evaluation to measure the impact of the changes.

Repeat this cycle to incrementally improve your semantic view’s accuracy. Tracking accuracy across runs lets you detect regressions if a change inadvertently breaks previously correct queries.

Known limitations¶

Cortex Analyst evaluations are subject to the following limitations:

Single semantic view per run: Each evaluation run evaluates one semantic view. Evaluating across multiple semantic views in a single run is not supported.
No multi-turn evaluation: Evaluation queries are processed independently. Follow-up or multi-turn conversation evaluation is not supported.
No auto-generated evaluation datasets: Evaluation sets must be manually curated from verified queries. Automatic generation from query history, dashboards, or synthetic generation is not available.
Ground truth staleness: If your verified queries reference time-relative concepts (for example, last quarter rather than Q1 2025), evaluation results may drift over time. Scope queries to specific, absolute dates and time ranges for consistent results.

Cost considerations¶

Cortex Analyst evaluations run queries against your semantic view and use evaluation judges to score correctness. You are charged for:

Warehouse charges for running the evaluation queries against Cortex Analyst using the warehouse selected for the evaluation run.
Evaluation credits for the AI_COMPLETE function calls used to compute the sql_correctness metric.
Storage charges for datasets and evaluation results stored in your account.

For more information on estimating costs, see Understanding overall cost.