Cortex Analyst semantic model specification

Why use semantic models?

Cortex Analyst allows users to ask questions about Snowflake data using natural language. Typically there is a gap between the vocabulary used by business users when asking data questions and the vocabulary used in the database schema. These users typically use business or domain specific terms, whereas the database schema may use abbreviations or terms often used in ETL pipelines. Database schemas also often lack the semantic details to help someone unfamiliar with a dataset to understand the data. This difference in vocabulary makes it challenging for Cortex Analyst to answer data questions with high precision.

A semantic model is a lightweight mechanism that addresses these issues by allowing for the specification of additional semantic details about a dataset. These additional semantic details, like more descriptive names or synonyms, enable Cortex Analyst to answer data questions much more reliably.

Note

Semantic models are considered metadata.

Key concepts

Note

Throughout this topic, the base database related artifacts are referred to as “physical” artifacts and semantic model related artifacts as “logical” artifacts.

The structure and concepts of the semantic model are similar to those in database schemas, but semantic models allow you to provide more semantic information about your data.

A core concept of Snowflake’s semantic models is the logical table. A logical table can be thought of as a simple view over a physical database table or view. Each logical table is a collection of logical columns, which are categorized into dimensions, time_dimensions, and measures.

A logical column can be either a reference to an underlying physical column or an expression involving one or more physical columns. The expression can be simple, such as a string concatenation or a mathematical operation, or it can include aggregation expressions. For example, you can use an expression like SUM(click)/SUM(spend) for calculating cost-per-click (CPC).

The semantic model can also contain a verified query repository (VQR), which can help improve the accuracy and trustworthiness of results by providing a collection of questions and corresponding SQL queries to answer them.

The semantic model for Cortex Analyst specification is in YAML. It provides the necessary semantic information to answer natural language questions with high precision.

The general syntax is:

# Name and description of the semantic model.
name: <name>
description: <string>

# A semantic model can contain one or more logical tables.
tables:

  # A logical table on top of a base table.
  - name: <name>
    description: <string>

    # The fully qualified name of the base table.
    base_table:
      database: <database>
      schema: <schema>
      table: <base table name>

    # Dimension columns in the logical table.
    dimensions:
      - name: <name>
        synonyms: <array of strings>
        description: <string>
        expr: <SQL expression>
        data_type: <data type>
        unique: <boolean>
        cortex_search_service_name: <string>

    # Time dimension columns in the logical table.
    time_dimensions:
      - name: <name>
        synonyms:  <array of strings>
        description: <string>
        expr: <SQL expression>
        data_type: <data type>
        unique: <boolean>

    # Measure columns in the logical table.
    measures:
      - name: <name>
        synonyms: <array of strings>
        description: <string>
        expr: <SQL expression>
        data_type: <data type>
        default_aggregation: <aggregate function>

    # Commonly used filters over the logical table.
    filters:
      - name: <name>
        synonyms: <array of strings>
        description: <string>
        expr: <SQL expression>

# Example questions and queries that answer them
verified_queries:

# Verified Query (1 of n)
- name:                         # A descriptive name of the query.
  question:                     # The natural language question that this query answers.
  verified_at:                  # Optional: Time (in seconds since the UNIX epoch, January 1, 1970) when the query was verified.
  verified_by:                  # Optional: Name of the person who verified the query.
  use_as_onboarding_question:   # Optional: Marks this question as an onboarding question for the end user.
  sql:                          # The SQL query for answering the question.
Copy

Semantic model generator

You can use the semantic model generator tool to generate a semantic model for use with Cortex Analyst. The tool offers multiple options for generating semantic models: through a Streamlit app, via the command line, or directly in your Python code. The Streamlit app is particularly convenient, allowing you to create semantic models from scratch or iterate on existing ones uploaded to a Snowflake stage.

Defining joins

Cortex Analyst supports SQL joins, enabling more advanced data analysis across multiple tables, especially in star schema structures. This feature allows you to query data from fact tables and associated dimension tables with ease. Currently, Snowflake supports simple star schemas, with validation constraints in place to ensure the accuracy of queries.

A star schema is a widely used design in business intelligence (BI) that consists of:

  • Fact tables: contain quantitative data, such as sales or revenue figures.

  • Dimension tables: hold descriptive attributes related to the fact table, such as customer or product information.

Joins in star schemas typically occur between fact tables and dimension tables based on one-to-one or many-to-one relationships.

Defining joins in the semantic context YAML file

To enable joins in Cortex Analyst, update the semantic context YAML file by adding the relationship definitions after the list of tables. See the example:

relationships:
  - name: <string>
    left_table: <string>
    right_table: <string>
    relationship_columns:
      - left_column: <string>
        right_column: <string>
      - left_column: <string>
        right_column: <string>
    join_type: <join_type>
    relationship_type: <relationship_type>
Copy

Where:

  • name: A unique identifier for the relationship.

  • left_table and right_table: Logical table names as defined earlier in your YAML file.

  • relationship_columns: Specifies equality joins only. Non-equality joins are currently not supported. Must use logical column names (no expressions allowed).

  • join_type: Supports either left_outer or inner joins.

  • relationship_type: Supports many_to_one or one_to_one relationships.

To ensure proper join functionality, primary keys must be defined for the tables involved in relationships. See the example:

tables:
  - name: orders
    primary_key:
      columns:
        - order_id
  - location_id
Copy

Example

For an example of star schema joins defined, see jaffle_shop.yaml.

Known limitations

For complicated schema structures beyond star schema, queries that involve non-simple patterns face the risk of fan trap or chasm trap. Snowflake is working on handling such scenarios to support more complicated schema shapes.

Tips for creating a semantic model

  • Organize your YAML file by business domain or topic

    • Structure your YAML files to align with specific business domains or topics, keeping the scope focused. For instance, create separate semantic models for sales analytics and marketing analytics.

    • Tailor your use cases based on the target audience, expected questions or KPIs, and required data. Well-defined use cases lead to richer semantic models and more effective data retrieval.

  • Think from the end-user perspective

    • Identify key questions that users are likely to ask about the topic and include only the tables and columns needed to answer those questions.

    • Use names and synonyms that are similar to the vocabulary used by your end users.

    • Include important details in description fields that would be helpful to someone writing queries on this dataset for the first time, such as the timezone of a DATETIME column.

  • Capture complex calculations

    Incorporate more difficult or business-specific queries into expressions.

  • Use wide tables instead of long tables

    If you have a table with columns like “metric” and “value”, flatten the table so that each metric is a column. This approach provides the model with more semantic information on each metric.

  • Review auto-generated descriptions

    If you’re using the semantic model generator, it attempts to automatically generate descriptions for your tables and columns. Always review these descriptions to ensure that they are reasonable and relevant; make modifications as needed.

  • Start simple and expand gradually

    A well-scoped semantic file ensures higher precision and accuracy of results. Start with a small number of tables and columns and expand the semantic model YAML gradually to cover more kinds of questions. Remember that YAML building is a continuous process.

  • Include verified queries

    A verified query repository (VQR), which is a collection of plain-English questions and queries that answer them, can help improve the accuracy and trustworthiness of results.

Known limitations

  • Cortex Analyst imposes a 1MB size limit on the semantic model file to restrict the size of API inputs.

  • Cortex Analyst performs in-memory retrieval of sample values and verified queries added to the semantic YAML. After removing all sample values and verified queries, the semantic model cannot exceed 32K tokens (roughly 4 x 32K characters or approximately 128 KB). You can use the semantic model generator validation command to ensure your file stays within these limits. The limits might increase as support is added for models with larger context windows.

Specification

Semantic Model

A semantic model represents a collection of tables. The model contains descriptions of tables, each of which contains descriptions of specifics aspects of the table. Each table described in the model maps to a physical base table in Snowflake.

It has the following fields:

Required

name

A descriptive name for this semantic model.

Must be unique and follow the unquoted identifiers requirements. It also cannot conflict with Snowflake reserved keywords.

Optional (recommended)

description

A description of this semantic model, including details of what kind of analysis it’s useful for.

tables

A list of logical tables in this semantic model.

Table

A logical table can be thought of as a view over a physical database table or view. It has the following fields:

Required

name

A descriptive name for this table.

Must be unique and follow the unquoted identifiers requirements. It also cannot conflict with Snowflake reserved keywords.

base_table

A fully qualified name of the underlying base table in the database.

Optional (recommended)

synonyms

A list of other terms/phrases used to refer to this table. Must be unique across synonyms within the logical table.

description

A description of this table.

dimensions

A list of dimension columns in this table.

time_dimensions

A list of time dimension columns in this table.

measures

A list of measures in this table.

filters

Predefined filters on this table, if any.

Dimension

A dimension describes categorical values such as state, user_type, platform, etc. It has the following fields:

Required

name

A descriptive name for this dimension.

Must be unique and follow the unquoted identifiers requirements. It also cannot conflict with Snowflake reserved keywords.

expr

The SQL expression for this dimension. This could be a reference to a physical column or a SQL expression with one or more columns from the underlying base table.

data_type

The data type of this dimension. For an overview of all data types in Snowflake see SQL data types reference. Note that VARIANT, OBJECT, GEOGRAPHY, and ARRAY are currently not supported.

Optional (recommended)

synonyms

A list of other terms/phrases used to refer to this dimension. Must be unique across all synonyms in this semantic model.

description

A brief description about this dimension, including what data it has.

unique

A boolean value that indicates this dimension has unique values.

sample_values

Sample values of this column, if any. Add any value that are likely to be referenced in the user questions.

cortex_search_service_name

Name of the Cortex Search Service that searches the literal values in the column. If both a service name and sample_values are specified, the values in the sample_values field are ignored.

Time Dimension

A time dimension describes time values, such as sale_date, created_at, and year. It has the following fields:

Required

name

A descriptive name for this time dimension.

Must be unique and follow the unquoted identifiers requirements. It also cannot conflict with Snowflake reserved keywords.

expr

The SQL expression for this column. This could be a reference to a physical column or a SQL expression with one or more columns from the underlying base table.

data_type

The data type of this time dimension. For an overview of all data types in Snowflake see SQL data types reference. Note that VARIANT, OBJECT, GEOGRAPHY, and ARRAY are currently not supported.

Optional (recommended)

synonyms

A list of other terms/phrases used to refer to this time dimension. Must be unique across all synonyms in this semantic model.

description

A brief description about this dimension, including what data it has. Provide information that’ll help someone writing queries using this table. For example, for DATETIME columns, specify the timezone of the data.

unique:

A boolean value that indicates this column has unique values.

sample_values:

Sample values of this column, if any. Add any values that are likely to be referenced in the user questions. This field is optional.

Measure

A measure describes numerical values, such as revenue, impressions, and salary. It has the following fields:

Required

name

A descriptive name for this measure.

Must be unique and follow the unquoted identifiers requirements. It also cannot conflict with Snowflake reserved keywords.

expr

The SQL expression for this column. This could be a reference to a physical column or a SQL expression with one or more columns from the underlying base table.

data_type

The data type of this measure. For an overview of all data types in Snowflake see SQL data types reference. Note that VARIANT, OBJECT, GEOGRAPHY, and ARRAY are currently not supported.

Optional (recommended)

synonyms

A list of other terms/phrases used to refer to this measure. Must be unique across all synonyms in this semantic model.

description

A brief description about this measure, including what data this column has.

unique

A boolean value that indicates this column has unique values.

default_aggregation

The default aggregation applied to this column in the context of a grouping. Some example values are sum, avg, min, max, median, count, count_distinct, etc.

sample_values

Sample values of this column, if any. Add any values that are likely to be referenced in the user questions.

Filter

A filter represents a SQL expression that’s used for filtering. It has the following fields:

Required

name

A descriptive name for this filter.

Must be unique and follow the unquoted identifiers requirements. It also cannot conflict with Snowflake reserved keywords.

expr

The SQL expression of this filter, referencing logical columns.

Optional (recommended)

synonyms

A list of other terms/phrases used to refer to this filter. Must be unique across all synonyms in this semantic model.

description

A brief description about this filter, including details of what this filter is typically used for.

Base Table

A base table is used to represent fully qualified table names. This is the physical table that a logical table maps to. It has the following fields:

Required

database

Name of the database.

schema

Name of the schema.

table

Name of the table.

Verified Queries

See Cortex Analyst Verified Query Repository for information on the purpose and structure of this section of the YAML file.

Example YAML

This section shows an example semantic model YAML.

This example assumes you have the following table in your schema:

CREATE TABLE sales.public.sd_data (
    id INT PRIMARY KEY,
    dt DATETIME,
    cat VARCHAR(255),
    loc VARCHAR(255),
    cntry VARCHAR(255),
    chn VARCHAR(50),
    amt DECIMAL(10, 2),
    unts INT,
    cst DECIMAL(10, 2)
);
Copy

This table has cryptic column names and not much semantic information about the columns. The following is a semantic model YAML that gives more semantic information about this table:

# Name and description of the semantic model.
name: Sales Data
description: This semantic model can be used for asking questions over the sales data.

# A semantic model can contain one or more tables.
tables:

  # A logical table on top of the 'sd_data' base table.
  - name: sales_data
    description: A logical table capturing daily sales information across different store locations and product categories.

    # The fully qualified name of the base table.
    base_table:
      database: sales
      schema: public
      table: sd_data

    # Dimension columns in the logical table.
    dimensions:
      - name: product_category
        synonyms:
          - "item_category"
          - "product_type"
        description: The category of the product sold.
        expr: cat
        data_type: NUMBER
        unique: false
        sample_values:
          - "501"
          - "544"

      - name: store_country
        description: The country where the sale took place.
        expr: cntry
        data_type: TEXT
        unique: false
        sample_values:
          - "USA"
          - "GBR"

      - name: sales_channel
        synonyms:
          - "channel"
          - "distribution_channel"
        description: The channel through which the sale was made.
        expr: chn
        data_type: TEXT
        unique: false
        sample_values:
          - "FB"
          - "GOOGLE"

    # Time dimension columns in the logical table.
    time_dimensions:
      - name: sale_timestamp
        synonyms:
          - "time_of_sale"
          - "transaction_time"
        description: The time when the sale occurred. In UTC.
        expr: dt
        data_type: TIMESTAMP
        unique: false

    # Measure columns in the logical table.
    measures:
      - name: sales_amount
        synonyms:
          - "revenue"
          - "total_sales"
        description: The total amount of money generated from the sale.
        expr: amt
        data_type: NUMBER
        default_aggregation: sum

      - name: sales_tax
        description: The sales tax paid for this sale.
        expr: amt * 0.0975
        data_type: NUMBER
        default_aggregation: sum

      - name: units_sold
        synonyms:
          - "quantity_sold"
          - "number_of_units"
        description: The number of units sold in the transaction.
        expr: unts
        data_type: NUMBER
        default_aggregation: sum

      - name: cost
        description: The cost of the product sold.
        expr: cst
        data_type: NUMBER
        default_aggregation: sum

      - name: profit
        synonyms:
          - "earnings"
          - "net income"
        description: The profit generated from a sale.
        expr: amt - cst
        data_type: NUMBER
        default_aggregation: sum

    # A table can define commonly used filters over it. These filters can then be referenced in user questions directly.
    filters:
      - name: north_america
        synonyms:
          - "North America"
          - "N.A."
          - "NA"
        description: "A filter to restrict only to north american countries"
        expr: cntry IN ('canada', 'mexico', 'usa')
Copy