Cortex Analyst semantic model specification¶
Why use semantic models?¶
Cortex Analyst allows users to query Snowflake data using natural language. However, business users often use language that is incompatible with the schema. While users specify domain-specific business terms in their questions, the underlying data is often stored using technical abbreviations. For example, “CUST” is often used for customers. This disconnect, combined with the schema’s lack of semantic context, makes it difficult for Cortex Analyst to provide accurate answers.
Semantic models map business terminology to database schemas and add contextual meaning. For example, when a user asks about “total revenue last month”, the semantic model can define “revenue” as net revenue and “last month” as the previous calendar month. This mapping helps Cortex Analyst understand the user’s intent and provide accurate answers.
Note
Semantic models are considered metadata.
Key concepts¶
Note
In this topic, database artifacts are called “physical” objects, while semantic model artifacts are called “logical” objects.
The structure and concepts of the semantic model are similar to those in database schemas, but semantic models allow you to provide more semantic information about your data.
Semantic layer concepts¶
The structure and concepts that define the logical semantic layer are similar to those that define the physical database layer. The following are the concept types for a semantic layer:
Logical table-level
Model-level
Additional context
data:image/s3,"s3://crabby-images/3b962/3b96259f05ea6439fc1ab76fbe6b0474f7061ca7" alt="Semantic model concepts showing semantic model at the top."
Logical table-level concepts¶
A logical table is a foundational concept of Snowflake’s semantic model. It represents either a physical database table or a view. It typically corresponds to business entities (such as customers, orders, or suppliers) or dimensions (such as location or time). Each row in a logical table represents typically represents a unique instance of the entity, such as a customer ID.
Logical tables contain the following kinds of columns:
Facts (quantitative data about business events)
Dimensions (who, what, where, and how)
Time (when the event occurred)
Filters associated with the logical table to allow query results to be limited to specific data subsets.
You can define metrics by using aggregations or combinations of other logical objects.
The following examples use the TPC-H schema, which includes the LINEITEM fact table. For a complete YAML implementation, download the snow_tpch
file.
data:image/s3,"s3://crabby-images/b76c8/b76c81d24f2a32f5d62429dffee007bed8dcff63" alt="TPC-H schema showing the relationship between the LINEITEM fact table and the DIMENSIONS tables."
All of the following logical table-level examples belong to the order_lineitems
logical table.
tables:
- name: order_lineitems
description: >
The order line items table contains detailed information about each item within an
order, including quantities, pricing, and dates.
base_table:
database: SNOWFLAKE_SAMPLE_DATA
schema: TPCH_SF1
table: LINEITEM
primary_key:
columns:
- order_key
- order_lineitem_number
A dimension represents categorical data that provides context to facts, such as product, customer, or location information. Dimensions typically contain descriptive text values, such as product names or customer addresses. They are used to filter, group, and label facts in analyses and reports.
A time dimension provides temporal context for analyzing facts across different periods. It enables tracking metrics over specific time intervals (dates, months, years) and supports analyses such as trend identification and period-over-period comparisons.
time_dimensions:
- name: shipment_duration
synonyms:
- "shipping time"
- "shipment time"
description: The time it takes for items to be shipped.
expr: DATEDIFF(day, lineitem.L_SHIPDATE, lineitem.L_RECEIPTDATE)
data_type: NUMBER
unique: false
Facts are measurable, quantitative data that provide context for analyses. Facts represent numeric values related to business processes, such as sales, cost, or quantity. A fact is an unaggregated, row-level concept.
# Fact columns in the logical table.
facts:
- name: net_revenue
synonyms:
- "revenue after discount"
- "net sales"
description: Net revenue after applying discounts.
expr: lineitem.L_EXTENDEDPRICE * (1 - lineitem.L_DISCOUNT)
data_type: NUMBER
A filter is a condition that limits query results to specific data subsets based on criteria such as time period, location, or category.
- name: north_america
synonyms:
- "NA"
- "North America region"
description: >
Filter to restrict data to orders from North America.
comments: Used for analysis focusing on North American customers.
expr: nation.N_NAME IN ('Canada', 'Mexico', 'United States')
A metric is a quantifiable measure of business performance, typically calculated by aggregating facts across multiple rows. You express a metric as a SQL formula that can combine multiple logical objects and aggregate functions, such as SUM() or AVG(). You can use metrics as key performance indicators (KPIs) in reports and dashboards. Define metrics at their most granular level for aggregation at higher levels. For example, define total_revenue at the lineitem level to allow aggregation by customer, supplier, or region.
metrics:
# Simple metric referencing objects from the same logical table
- name: total_revenue
expr: SUM(lineitem.l_extendedprice * (1 - lineitem.l_discount))
# Complex metric referencing objects from multiple logical tables.
# The relationships between tables have been defined below.
- name: total_profit_margin
description: >
The profit margin from orders. This metric is not additive
and should always be calculated directly from the base tables.
expr: (SUM(order_lineitems.net_revenue) -
SUM(part_suppliers.part_supplier_cost * order_lineitems.lineitem_quantity))
/ SUM(order_lineitems.net_revenue)
Allowed references¶
Expressions for dimensions, facts, metrics, or filters can reference:
Physical columns from their own base table
Logical columns within the same logical table
Logical columns from other logical tables in the semantic model
Note
Expressions can’t reference physical columns from other physical tables.
Model level concepts¶
Relationships connect logical tables through joins on shared keys. For example, there could be a relationship between the customers and orders tables through a join on the customer_id column. You can use joins to analyze order data with customer attributes.
relationships:
# Relationship between orders and lineitems
- name: order_lineitems_to_orders
left_table: order_lineitems
right_table: orders
join_key:
- left_column: order_key
right_column: order_key
join_type: left_outer
relationship_type: many_to_one
# Relationship between lineitems and partsuppliers
- name: order_lineitems_to_part_suppliers
left_table: order_lineitems
right_table: part_suppliers
# The relationship requires equality of multiple columns from each table
relationship_columns:
- left_column: part_key
right_column: part_key
- left_column: supplier_key
right_column: supplier_key
join_type: left_outer
relationship_type: many_to_one
A verified query repository (VQR) is a collection of questions and corresponding SQL queries that are verified to be correct. You can use the queries to help improve the accuracy of Cortex Analyst’s’ results.
You can use Custom instructions in Cortex Analyst to have greater control over Cortex Analyst’s SQL query generation. Within the custom instructions, you provide the unique context of your business to the LLM.
Cortex Analyst semantic models are spefified in YAML. The model provides the necessary semantic information to answer natural language questions with high precision.
YAML format¶
YAML is balanced between human readability and precision. Business users can understand it while data engineers and analysts can define technical concepts clearly.
The general syntax of a Cortex Analyst semantic model is:
# Name and description of the semantic model.
name: <name>
description: <string>
comments: <string>
# Logical table-level concepts
# A semantic model can contain one or more logical tables.
tables:
# A logical table on top of a base table.
- name: <name>
description: <string>
# The fully qualified name of the base table.
base_table:
database: <database>
schema: <schema>
table: <base table name>
# Dimension columns in the logical table.
dimensions:
- name: <name>
synonyms: <array of strings>
description: <string>
expr: <SQL expression>
data_type: <data type>
unique: <boolean>
cortex_search_service:
service: <string>
literal_column: <string>
database: <string>
schema: <string>
is_enum: <boolean>
# Time dimension columns in the logical table.
time_dimensions:
- name: <name>
synonyms: <array of strings>
description: <string>
expr: <SQL expression>
data_type: <data type>
unique: <boolean>
# Fact columns in the logical table.
facts:
- name: <name>
synonyms: <array of strings>
description: <string>
expr: <SQL expression>
data_type: <data type>
# Business metrics across logical objects
metrics:
- name: <name>
synonyms: <array of strings>
description: <string>
expr: <SQL expression>
data_type: <data type>
# Commonly used filters over the logical table.
filters:
- name: <name>
synonyms: <array of strings>
description: <string>
expr: <SQL expression>
# Model-level concepts
# Relationships between logical tables
relationships:
- name: <string>
description: <string>
left_table: <table>
right_table: <table>
join_key:
- left_column: <column>
right_column: <column>
- left_column: <column>
right_column: <column>
join_type: <left_outer | inner>
relationship_type: < one_to_one | many_to_one>
# Additional context concepts
# Verified queries with example questions and queries that answer them
verified_queries:
# Verified Query (1 of n)
- name: # A descriptive name of the query.
question: # The natural language question that this query answers.
verified_at: # Optional: Time (in seconds since the UNIX epoch, January 1, 1970) when the query was verified.
verified_by: # Optional: Name of the person who verified the query.
use_as_onboarding_question: # Optional: Marks this question as an onboarding question for the end user.
expr: # The SQL query for answering the question
For an example specification, see the snow_tpch.yaml
file.
Create a semantic model using the model generator¶
Use the semantic model generator to create a semantic model from your tables. Instead of creating a semantic model manually with your own YAML specification, you can use the model generator within Snowsight to save time. The process of creating a semantic model involves doing the following:
Providing a description with basic information about the model.
Providing the data source that you’re using to create the model. You must provide at least one table or view.
Selecting the columns that you’re using to create the model.
To get started with creating the model, navigate to the Let’s Create a Semantic Model page:
In Snowsight, select AI & ML.
Next to Cortex Analyst, select Try.
Choose Create new.
You’ve opened the description page of the semantic model generator. To create a semantic model, do the following:
For Description, specify information about the semantic model. You must provide the file name, file location, and model name. You can also provide a description with context about the data.
(Optional) For User Questions, specify the types of questions that users can ask about the data. The model generator uses the information that you provide in the field to suggest tables, views, columns that you can use to create the model.
For Select Data (Table/View), provide the data source that you’re using to create the semantic model. You must provide at least one table or view. There’s no limit on the tables or views that you can specify, but we recommend not using more than 10 for the semantic model.
For Select Columns, select the columns that you’re using to create the semantic model. You can select all the columns or specific columns. For performance reasons, we recommend not using more than 50 columns.
After you’ve created the model, save it to a stage. To save, choose Save at the top-right corner of the screen. If you need to make further modifications, you can edit the model by either using Snowsight or by editing the YAML file directly.
Open an existing semantic model¶
After you’ve created a semantic model, you can open it in Snowsight. To open a semantic model, do the following:
Select Open semantic model.
Select Open.
Select Select from stage.
Select the database and schema.
Click outside of the dialog box.
Select the stage where you saved the file.
Select
Open
.
Note
If you aren’t seeing your semantic model within your stage, try refreshing the list of models, not the page.
Tips for creating a semantic model¶
Organize your YAML file by business domain or topic
Structure your YAML files to align with specific business domains or topics, keeping the scope focused. For instance, create separate semantic models for sales analytics and marketing analytics.
Tailor your use cases based on the target audience, expected questions or KPIs, and required data. Well-defined use cases lead to richer semantic models and more effective data retrieval.
Think from the end-user perspective
Identify key questions that users are likely to ask about the topic and include only the tables and columns needed to answer those questions.
Use names and synonyms that are similar to the vocabulary used by your end users.
Include important details in description fields that would be helpful to someone writing queries on this dataset for the first time, such as the timezone of a DATETIME column.
Capture complex calculations
Incorporate more difficult or business-specific queries into expressions.
Use wide tables instead of long tables
If you have a table with columns like “metric” and “value”, flatten the table so that each metric is a column. This approach provides the model with more semantic information on each metric.
Review auto-generated descriptions
If you’re using the semantic model generator, it attempts to automatically generate descriptions for your tables and columns. Always review these descriptions to ensure that they are reasonable and relevant; make modifications as needed.
Start simple and expand gradually
A well-scoped semantic file ensures higher precision and accuracy of results. Start with a small number of tables and columns and expand the semantic model YAML gradually to cover more kinds of questions. Remember that YAML building is a continuous process.
Include verified queries
A verified query repository (VQR), which is a collection of plain-English questions and queries that answer them, can help improve the accuracy and trustworthiness of results.
Known limitations¶
Cortex Analyst imposes a 1 MB size limit on the semantic model file to restrict the size of API inputs.
Cortex Analyst performs in-memory retrieval of sample values and verified queries added to the semantic YAML. After removing all sample values and verified queries, the semantic model cannot exceed 32K tokens (roughly 4 x 32K characters or approximately 128 KB). You can use the semantic model generator validation command to ensure your file stays within these limits. The limits might increase as support is added for models with larger context windows.
Specification¶
This section contains detailed specifications for the key concepts described in the preceding sections.
Semantic Model¶
A semantic model represents a collection of tables. The model contains descriptions of tables, each of which contains descriptions of specifics aspects of the table. Each table described in the model maps to a physical base table in Snowflake.
It has the following fields:
Required
name
A descriptive name for this semantic model.
Must be unique and follow the unquoted identifiers requirements. It also cannot conflict with Snowflake reserved keywords.
Optional
These fields are not required, but should be included whenever possible.
description
A description of this semantic model, including details of what kind of analysis it’s useful for.
tables
A list of logical tables in this semantic model.
relationships
A list of joins between logical tables.
Logical table¶
You can think of a logical table as a view over a physical database table or view. It has the following fields:
Required
name
A descriptive name for this table.
Must be unique and follow the unquoted identifiers requirements. It also cannot conflict with Snowflake reserved keywords.
base_table
A fully qualified name of the underlying base table in the database.
Optional
These fields are not required, but should be included whenever possible.
synonyms
A list of other terms/phrases used to refer to this table. It must be unique across synonyms within the logical table.
description
A description of this table.
primary_key
The primary key columns for this table. Required if you’re defining relationnships.
dimensions
A list of dimension columns in this table.
time_dimensions
A list of time dimension columns in this table.
facts
A list of fact columns in this table.
metrics
A list of metrics in this table.
filters
Predefined filters on this table, if any.
Dimension¶
A dimension describes categorical values such as state, user_type, platform, etc. It has the following fields:
Required
name
A descriptive name for this dimension.
Must be unique and follow the unquoted identifiers requirements. It also cannot conflict with Snowflake reserved keywords.
expr
The SQL expression for this dimension. This could be a reference to a physical column or a SQL expression with one or more columns from the underlying base table.
data_type
The data type of this dimension. For an overview of all data types in Snowflake see SQL data types reference. Note that
VARIANT
,OBJECT
,GEOGRAPHY
, andARRAY
are currently not supported.
Optional
These fields are not required, but should be included whenever possible.
synonyms
A list of other terms/phrases used to refer to this dimension. Must be unique across all synonyms in this semantic model.
description
A brief description about this dimension, including what data it has.
unique
A boolean value that indicates this dimension has unique values.
sample_values
Sample values of this column, if any. Add any value that is likely to be referenced in the user questions.
cortex_search_service
Specifies the Cortex Search Service to use for this dimension. It has the following fields:
service
: The name of the Cortex Search Service.literal_column
: (optional) The column in the Cortex Search Service that contains the literal values.database
: (optional) The database where the Cortex Search Service is located. Defaults tobase_table
’s database.schema
: (optional) The schema where the Cortex Search Service is located. Defaults tobase_table
’s schema.
This field replaces the
cortex_search_service_name
field, which could only specify the name.cortex_search_service_name
has been deprecated.is_enum
A Boolean value. If
True
, the values in thesample_values
field are taken to be the full list of possible values, and the model only chooses from those values when filtering on that column.
Time Dimension¶
A time dimension describes time values, such as sale_date, created_at, and year. It has the following fields:
Required
name
A descriptive name for this time dimension.
Must be unique and follow the unquoted identifiers requirements. It also cannot conflict with Snowflake reserved keywords.
expr
The SQL expression for this column. This could be a reference to a physical column or a SQL expression with one or more columns from the underlying base table.
data_type
The data type of this time dimension. For an overview of all data types in Snowflake see SQL data types reference. Note that
VARIANT
,OBJECT
,GEOGRAPHY
, andARRAY
are currently not supported.
Optional
These fields are not required, but should be included whenever possible.
synonyms
A list of other terms/phrases used to refer to this time dimension. Must be unique across all synonyms in this semantic model.
description
A brief description about this dimension, including what data it has. Provide information that’ll help someone writing queries using this table. For example, for DATETIME columns, specify the timezone of the data.
unique
:A boolean value that indicates this column has unique values.
sample_values
:Sample values of this column, if any. Add any values that are likely to be referenced in the user questions. This field is optional.
Fact¶
A fact describes numerical values, such as revenue, impressions, and salary. Facts used to be called measures in earlier releases of Cortex Analyst. Facts are backward compatible with measures. It has the following fields:
Required
name
A descriptive name for this fact.
Must be unique and follow the unquoted identifiers requirements. It also cannot conflict with Snowflake reserved keywords.
expr
This SQL expression can refer to either a physical column in the same logical table’s base physical table or a logical column (fact, dimension, or time dimension) within that logical table.
data_type
The data type of this fact. For an overview of all data types in Snowflake see SQL data types reference. Note that
VARIANT
,OBJECT
,GEOGRAPHY
, andARRAY
are currently not supported.
Optional)
These fields are not required, but should be included whenever possible.
synonyms
A list of other terms/phrases used to refer to this measure. Must be unique across all synonyms in this semantic model.
description
A brief description about this measure, including what data this column has.
unique
A boolean value that indicates this column has unique values.
default_aggregation
The default aggregation applied to this column in the context of a grouping. Some example values are sum, avg, min, max, median, count, count_distinct, etc.
sample_values
Sample values of this column, if any. Add any values that are likely to be referenced in the user questions.
Filter¶
A filter represents a SQL expression that’s used for filtering. It has the following fields:
Required
name
A descriptive name for this filter.
Must be unique and follow the unquoted identifiers requirements. It also cannot conflict with Snowflake reserved keywords.
expr
The SQL expression of this filter, referencing logical columns.
Optional
These fields are not required, but should be included whenever possible.
synonyms
A list of other terms/phrases used to refer to this filter. Must be unique across all synonyms in this semantic model.
description
A brief description about this filter, including details of what this filter is typically used for.
Metric¶
A metric describes quantifiable measures of business performance, such as total revenue, average order value, or customer count. It has the following fields:
Required
name
A descriptive name for this metric. It must be unique and follow the requirements of the unquoted identifiers. It also cannot conflict with Snowflake reserved keywords.
Must be unique and follow the unquoted identifiers requirements. It also cannot conflict with Snowflake reserved keywords.
expr
The SQL expression for this column. This could reference a logical column (fact, dimension, or time dimension) in the same logical table or a logical column from another logical table within the semantic model.
data_type
The data type of this metric. For an overview of all data types in Snowflake, see the reference for SQL data types. Note that
VARIANT
,OBJECT
,GEOGRAPHY
, andARRAY
are currently not supported.default_aggregation
The default aggregation applied to this column in the context of a grouping. Some example values are sum, avg, min, max, median, count, count_distinct, etc.
Optional
These fields are not required, but should be included whenever possible.
synonyms
A list of other terms/phrases used to refer to this metric. It must be unique across all synonyms in this semantic model.
description
A brief description of this metric, including what data this column has.
sample_values
Sample values of this column, if any. Add any values that are likely to be referenced in the user questions.
Base Table¶
A base table is used to represent fully qualified table names. This is the physical table that a logical table maps to. It has the following fields:
Required
database
Name of the database.
schema
Name of the schema.
table
Name of the table.
Primary Key¶
A primary key represents the columns that uniquely represent each row of the table. The primary key is required if the table is used in a relationship. It has the following fields:
Required
columns
A list of dimension columns uniquely representing the table.
Relationships¶
Defines join relationships between logical tables. To ensure proper join functionality, primary keys must be defined for the tables involved in relationships. It has the following fields:
Required
name
A unique identifier for the relationship.
left_table
Logical table name as defined earlier in your YAML file. For many-to-one relationships, the left table should be the many side of the relationship for optimal performance.
right_table
Logical table name as defined earlier in your YAML file. For many-to-one relationships, the right table must be the one side of the relationship for optimal performance.
relationship_columns
A list of equal columns from each of the left table and right table representing the join path.
join_type
Either
left_outer
orinner
.relationship_type
Either
many_to_one
orone_to_one
.
Verified Queries¶
See Cortex Analyst Verified Query Repository for information on the purpose and structure of this section of the YAML file.
Custom instructions¶
For infomation about custom instructions, see Custom instructions in Cortex Analyst.