Cortex Analyst semantic model specification¶
Why use semantic models?¶
Cortex Analyst allows users to ask questions about Snowflake data using natural language. Typically there is a gap between the vocabulary used by business users when asking data questions and the vocabulary used in the database schema. These users typically use business or domain specific terms, whereas the database schema may use abbreviations or terms often used in ETL pipelines. Database schemas also often lack the semantic details to help someone unfamiliar with a dataset to understand the data. This difference in vocabulary makes it challenging for Cortex Analyst to answer data questions with high precision.
A semantic model is a lightweight mechanism that addresses these issues by allowing for the specification of additional semantic details about a dataset. These additional semantic details, like more descriptive names or synonyms, enable Cortex Analyst to answer data questions much more reliably.
Note
Semantic models are considered metadata.
Key concepts¶
Note
Throughout this topic, the base database related artifacts are referred to as “physical” artifacts and semantic model related artifacts as “logical” artifacts.
The structure and concepts of the semantic model are similar to those in database schemas, but semantic models allow you to provide more semantic information about your data.
A core concept of Snowflake’s semantic models is the logical table. A logical table can be thought of as a simple view over a
physical database table or view. Each logical table is a collection of logical columns, which are categorized into dimensions
,
time_dimensions
, and measures
.
A logical column can be either a reference to an underlying physical column or an expression involving one or more
physical columns. The expression can be simple, such as a string concatenation or a mathematical operation, or it can
include aggregation expressions. For example, you can use an expression like SUM(click)/SUM(spend)
for calculating
cost-per-click (CPC).
The semantic model can also contain a verified query repository (VQR), which can help improve the accuracy and trustworthiness of results by providing a collection of questions and corresponding SQL queries to answer them.
The semantic model for Cortex Analyst specification is in YAML. It provides the necessary semantic information to answer natural language questions with high precision.
The general syntax is:
# Name and description of the semantic model.
name: <name>
description: <string>
# A semantic model can contain one or more logical tables.
tables:
# A logical table on top of a base table.
- name: <name>
description: <string>
# The fully qualified name of the base table.
base_table:
database: <database>
schema: <schema>
table: <base table name>
# Dimension columns in the logical table.
dimensions:
- name: <name>
synonyms: <array of strings>
description: <string>
expr: <SQL expression>
data_type: <data type>
unique: <boolean>
cortex_search_service_name: <string>
# Time dimension columns in the logical table.
time_dimensions:
- name: <name>
synonyms: <array of strings>
description: <string>
expr: <SQL expression>
data_type: <data type>
unique: <boolean>
# Measure columns in the logical table.
measures:
- name: <name>
synonyms: <array of strings>
description: <string>
expr: <SQL expression>
data_type: <data type>
default_aggregation: <aggregate function>
# Commonly used filters over the logical table.
filters:
- name: <name>
synonyms: <array of strings>
description: <string>
expr: <SQL expression>
# Example questions and queries that answer them
verified_queries:
# Verified Query (1 of n)
- name: # A descriptive name of the query.
question: # The natural language question that this query answers.
verified_at: # Optional: Time (in seconds since the UNIX epoch, January 1, 1970) when the query was verified.
verified_by: # Optional: Name of the person who verified the query.
use_as_onboarding_question: # Optional: Marks this question as an onboarding question for the end user.
sql: # The SQL query for answering the question.
Semantic model generator¶
You can use the semantic model generator tool to generate a semantic model for use with Cortex Analyst. The tool offers multiple options for generating semantic models: through a Streamlit app, via the command line, or directly in your Python code. The Streamlit app is particularly convenient, allowing you to create semantic models from scratch or iterate on existing ones uploaded to a Snowflake stage.
Defining joins¶
Cortex Analyst supports SQL joins, enabling more advanced data analysis across multiple tables, especially in star schema structures. This feature allows you to query data from fact tables and associated dimension tables with ease. Currently, Snowflake supports simple star schemas, with validation constraints in place to ensure the accuracy of queries.
A star schema is a widely used design in business intelligence (BI) that consists of:
Fact tables: contain quantitative data, such as sales or revenue figures.
Dimension tables: hold descriptive attributes related to the fact table, such as customer or product information.
Joins in star schemas typically occur between fact tables and dimension tables based on one-to-one or many-to-one relationships.
Defining joins in the semantic context YAML file¶
To enable joins in Cortex Analyst, update the semantic context YAML file by adding the relationship definitions after the list of tables. See the example:
relationships:
- name: <string>
left_table: <string>
right_table: <string>
relationship_columns:
- left_column: <string>
right_column: <string>
- left_column: <string>
right_column: <string>
join_type: <join_type>
relationship_type: <relationship_type>
Where:
name
: A unique identifier for the relationship.left_table
andright_table
: Logical table names as defined earlier in your YAML file.relationship_columns
: Specifies equality joins only. Non-equality joins are currently not supported. Must use logical column names (no expressions allowed).join_type
: Supports eitherleft_outer
orinner joins
.relationship_type
: Supportsmany_to_one
orone_to_one
relationships.
To ensure proper join functionality, primary keys must be defined for the tables involved in relationships. See the example:
tables:
- name: orders
primary_key:
columns:
- order_id
- location_id
Example¶
For an example of star schema joins defined, see jaffle_shop.yaml.
Known limitations¶
For complicated schema structures beyond star schema, queries that involve non-simple patterns face the risk of fan trap or chasm trap. Snowflake is working on handling such scenarios to support more complicated schema shapes.
Tips for creating a semantic model¶
Organize your YAML file by business domain or topic
Structure your YAML files to align with specific business domains or topics, keeping the scope focused. For instance, create separate semantic models for sales analytics and marketing analytics.
Tailor your use cases based on the target audience, expected questions or KPIs, and required data. Well-defined use cases lead to richer semantic models and more effective data retrieval.
Think from the end-user perspective
Identify key questions that users are likely to ask about the topic and include only the tables and columns needed to answer those questions.
Use names and synonyms that are similar to the vocabulary used by your end users.
Include important details in description fields that would be helpful to someone writing queries on this dataset for the first time, such as the timezone of a DATETIME column.
Capture complex calculations
Incorporate more difficult or business-specific queries into expressions.
Use wide tables instead of long tables
If you have a table with columns like “metric” and “value”, flatten the table so that each metric is a column. This approach provides the model with more semantic information on each metric.
Review auto-generated descriptions
If you’re using the semantic model generator, it attempts to automatically generate descriptions for your tables and columns. Always review these descriptions to ensure that they are reasonable and relevant; make modifications as needed.
Start simple and expand gradually
A well-scoped semantic file ensures higher precision and accuracy of results. Start with a small number of tables and columns and expand the semantic model YAML gradually to cover more kinds of questions. Remember that YAML building is a continuous process.
Include verified queries
A verified query repository (VQR), which is a collection of plain-English questions and queries that answer them, can help improve the accuracy and trustworthiness of results.
Known limitations¶
Cortex Analyst imposes a 1MB size limit on the semantic model file to restrict the size of API inputs.
Cortex Analyst performs in-memory retrieval of sample values and verified queries added to the semantic YAML. After removing all sample values and verified queries, the semantic model cannot exceed 32K tokens (roughly 4 x 32K characters or approximately 128 KB). You can use the semantic model generator validation command to ensure your file stays within these limits. The limits might increase as support is added for models with larger context windows.
Specification¶
Semantic Model¶
A semantic model represents a collection of tables. The model contains descriptions of tables, each of which contains descriptions of specifics aspects of the table. Each table described in the model maps to a physical base table in Snowflake.
It has the following fields:
Required
name
A descriptive name for this semantic model.
Must be unique and follow the unquoted identifiers requirements. It also cannot conflict with Snowflake reserved keywords.
Optional (recommended)
description
A description of this semantic model, including details of what kind of analysis it’s useful for.
tables
A list of logical tables in this semantic model.
Table¶
A logical table can be thought of as a view over a physical database table or view. It has the following fields:
Required
name
A descriptive name for this table.
Must be unique and follow the unquoted identifiers requirements. It also cannot conflict with Snowflake reserved keywords.
base_table
A fully qualified name of the underlying base table in the database.
Optional (recommended)
synonyms
A list of other terms/phrases used to refer to this table. Must be unique across synonyms within the logical table.
description
A description of this table.
dimensions
A list of dimension columns in this table.
time_dimensions
A list of time dimension columns in this table.
measures
A list of measures in this table.
filters
Predefined filters on this table, if any.
Dimension¶
A dimension describes categorical values such as state, user_type, platform, etc. It has the following fields:
Required
name
A descriptive name for this dimension.
Must be unique and follow the unquoted identifiers requirements. It also cannot conflict with Snowflake reserved keywords.
expr
The SQL expression for this dimension. This could be a reference to a physical column or a SQL expression with one or more columns from the underlying base table.
data_type
The data type of this dimension. For an overview of all data types in Snowflake see SQL data types reference. Note that
VARIANT
,OBJECT
,GEOGRAPHY
, andARRAY
are currently not supported.
Optional (recommended)
synonyms
A list of other terms/phrases used to refer to this dimension. Must be unique across all synonyms in this semantic model.
description
A brief description about this dimension, including what data it has.
unique
A boolean value that indicates this dimension has unique values.
sample_values
Sample values of this column, if any. Add any value that are likely to be referenced in the user questions.
cortex_search_service_name
Name of the Cortex Search Service that searches the literal values in the column. If both a service name and
sample_values
are specified, the values in thesample_values
field are ignored.
Time Dimension¶
A time dimension describes time values, such as sale_date, created_at, and year. It has the following fields:
Required
name
A descriptive name for this time dimension.
Must be unique and follow the unquoted identifiers requirements. It also cannot conflict with Snowflake reserved keywords.
expr
The SQL expression for this column. This could be a reference to a physical column or a SQL expression with one or more columns from the underlying base table.
data_type
The data type of this time dimension. For an overview of all data types in Snowflake see SQL data types reference. Note that
VARIANT
,OBJECT
,GEOGRAPHY
, andARRAY
are currently not supported.
Optional (recommended)
synonyms
A list of other terms/phrases used to refer to this time dimension. Must be unique across all synonyms in this semantic model.
description
A brief description about this dimension, including what data it has. Provide information that’ll help someone writing queries using this table. For example, for DATETIME columns, specify the timezone of the data.
unique
:A boolean value that indicates this column has unique values.
sample_values
:Sample values of this column, if any. Add any values that are likely to be referenced in the user questions. This field is optional.
Measure¶
A measure describes numerical values, such as revenue, impressions, and salary. It has the following fields:
Required
name
A descriptive name for this measure.
Must be unique and follow the unquoted identifiers requirements. It also cannot conflict with Snowflake reserved keywords.
expr
The SQL expression for this column. This could be a reference to a physical column or a SQL expression with one or more columns from the underlying base table.
data_type
The data type of this measure. For an overview of all data types in Snowflake see SQL data types reference. Note that
VARIANT
,OBJECT
,GEOGRAPHY
, andARRAY
are currently not supported.
Optional (recommended)
synonyms
A list of other terms/phrases used to refer to this measure. Must be unique across all synonyms in this semantic model.
description
A brief description about this measure, including what data this column has.
unique
A boolean value that indicates this column has unique values.
default_aggregation
The default aggregation applied to this column in the context of a grouping. Some example values are sum, avg, min, max, median, count, count_distinct, etc.
sample_values
Sample values of this column, if any. Add any values that are likely to be referenced in the user questions.
Filter¶
A filter represents a SQL expression that’s used for filtering. It has the following fields:
Required
name
A descriptive name for this filter.
Must be unique and follow the unquoted identifiers requirements. It also cannot conflict with Snowflake reserved keywords.
expr
The SQL expression of this filter, referencing logical columns.
Optional (recommended)
synonyms
A list of other terms/phrases used to refer to this filter. Must be unique across all synonyms in this semantic model.
description
A brief description about this filter, including details of what this filter is typically used for.
Base Table¶
A base table is used to represent fully qualified table names. This is the physical table that a logical table maps to. It has the following fields:
Required
database
Name of the database.
schema
Name of the schema.
table
Name of the table.
Verified Queries¶
See Cortex Analyst Verified Query Repository for information on the purpose and structure of this section of the YAML file.
Example YAML¶
This section shows an example semantic model YAML.
This example assumes you have the following table in your schema:
CREATE TABLE sales.public.sd_data (
id INT PRIMARY KEY,
dt DATETIME,
cat VARCHAR(255),
loc VARCHAR(255),
cntry VARCHAR(255),
chn VARCHAR(50),
amt DECIMAL(10, 2),
unts INT,
cst DECIMAL(10, 2)
);
This table has cryptic column names and not much semantic information about the columns. The following is a semantic model YAML that gives more semantic information about this table:
# Name and description of the semantic model.
name: Sales Data
description: This semantic model can be used for asking questions over the sales data.
# A semantic model can contain one or more tables.
tables:
# A logical table on top of the 'sd_data' base table.
- name: sales_data
description: A logical table capturing daily sales information across different store locations and product categories.
# The fully qualified name of the base table.
base_table:
database: sales
schema: public
table: sd_data
# Dimension columns in the logical table.
dimensions:
- name: product_category
synonyms:
- "item_category"
- "product_type"
description: The category of the product sold.
expr: cat
data_type: NUMBER
unique: false
sample_values:
- "501"
- "544"
- name: store_country
description: The country where the sale took place.
expr: cntry
data_type: TEXT
unique: false
sample_values:
- "USA"
- "GBR"
- name: sales_channel
synonyms:
- "channel"
- "distribution_channel"
description: The channel through which the sale was made.
expr: chn
data_type: TEXT
unique: false
sample_values:
- "FB"
- "GOOGLE"
# Time dimension columns in the logical table.
time_dimensions:
- name: sale_timestamp
synonyms:
- "time_of_sale"
- "transaction_time"
description: The time when the sale occurred. In UTC.
expr: dt
data_type: TIMESTAMP
unique: false
# Measure columns in the logical table.
measures:
- name: sales_amount
synonyms:
- "revenue"
- "total_sales"
description: The total amount of money generated from the sale.
expr: amt
data_type: NUMBER
default_aggregation: sum
- name: sales_tax
description: The sales tax paid for this sale.
expr: amt * 0.0975
data_type: NUMBER
default_aggregation: sum
- name: units_sold
synonyms:
- "quantity_sold"
- "number_of_units"
description: The number of units sold in the transaction.
expr: unts
data_type: NUMBER
default_aggregation: sum
- name: cost
description: The cost of the product sold.
expr: cst
data_type: NUMBER
default_aggregation: sum
- name: profit
synonyms:
- "earnings"
- "net income"
description: The profit generated from a sale.
expr: amt - cst
data_type: NUMBER
default_aggregation: sum
# A table can define commonly used filters over it. These filters can then be referenced in user questions directly.
filters:
- name: north_america
synonyms:
- "North America"
- "N.A."
- "NA"
description: "A filter to restrict only to north american countries"
expr: cntry IN ('canada', 'mexico', 'usa')