Introduction to Classification

This topic provides information on how classification works.

Overview

Classification is a multi-step process that associates Snowflake-defined tags (i.e. system tags) to columns by analyzing the cells and metadata for personal data; this data can now be tracked by a data engineer.

Based on the tracking information and related audit processes, the data engineer can protect the column containing personal or sensitive data with a masking policy or the table containing this column with a row access policy.

The overall result of the classification and data protection steps is to facilitate compliance with data privacy regulations.

Process

This section explains how the Classification three-step process works. You can repeat this process as often as needed, such as after inserting new rows into a table or adding new columns to a table.

Classification simplifies to a three-step process: analyze, review, and apply. Each of these steps have different operations:

Analyze

The data engineer calls the EXTRACT_SEMANTIC_CATEGORIES function in the Snowflake account. This function analyzes columns in a table and outputs the possible categories and associated probabilities. The probability value indicates the likelihood that the column contains personal data as defined by the possible categories. For details, refer to Interpret the Classification Output (in this topic).

Note

Complex view definitions, such as views that require JOIN operations on multiple objects, can slow the analyze process. Prior to classifying a complex view, consider the warehouse in use. For details, refer to Compute Costs (in this topic).

Review

In the review step, the data engineer reviews the category results to ensure the results of the analyze step operations make sense. If no revisions are necessary, the data engineer can proceed to the apply step.

If revisions are necessary the data engineer can revise the output of the analyze step before moving to the apply step.

For details, refer to Interpret the Classification Output (in this topic).

Apply

The apply step has one operation: assign the system tag.

The data engineer can manually set a system tag on a column or call the ASSOCIATE_SEMANTIC_CATEGORY_TAGS stored procedure.

After assigning the system tags, the data engineer can then track the system tag and protect the personal or sensitive data with a masking policy or a row access policy.

For detailed examples of this process, refer to Using Data Classification.

System Tags & Categories

System tags are tags that Snowflake creates, maintains, and makes available in the shared SNOWFLAKE database. There are two Classification system tags, both of which exist in the SNOWFLAKE.CORE schema:

  • SNOWFLAKE.CORE.SEMANTIC_CATEGORY

  • SNOWFLAKE.CORE.PRIVACY_CATEGORY

The data engineer assigns these tags to a column containing personal or sensitive data.

String values

Snowflake stores the assignment of a system tag on a column as a key-value pair, where the value is a string. Snowflake defines the allowed string values for each classification system tag because Snowflake maintains each of these system tags.

Call the SYSTEM$GET_TAG_ALLOWED_VALUES function in your Snowflake account to view the list of supported Classification system tag string values. For example:

select system$get_tag_allowed_values('snowflake.core.semantic_category');

select system$get_tag_allowed_values('snowflake.core.privacy_category');
Copy

The tag names, SEMANTIC_CATEGORY and PRIVACY_CATEGORY, correspond to the Classification categories that Snowflake assigns to the column data during the column sampling process (i.e. the tag names and category names use the same words):

Semantic category

The semantic category identifies personal attributes.

A non-exhaustive list of personal attributes Classification supports include name, age, and gender. These three attributes are possible string values when assigning the SEMANTIC_CATEGORY tag to a column.

Privacy category

If the analysis determines that the column data corresponds to a semantic category, Snowflake further classifies the column to a privacy category. The privacy category has three values: identifier, quasi-identifier, or sensitive. These three values are the string values that can be specified when assigning the PRIVACY_CATEGORY Classification system tag to a column.

  • Identifier: These attributes uniquely identify an individual. Example attributes include name, social security number, and phone number.

    Identifier attributes are synonymous with direct identifiers.

  • Quasi-identifier: These attributes can uniquely identify an individual when two or more or these attributes are in combination. Example attributes include age and gender.

    Quasi-identifiers are synonymous with indirect identifiers.

  • Sensitive: These attributes are not considered enough to identify an individual but are information that the individual would rather not disclose for privacy reasons.

    Currently, the only attribute that Snowflake evaluates as sensitive is salary.

The following table summarizes the relationship between each classification category and system tag, the allowed string values for each classification system tag, and the system tag string values for each system tag:

PRIVACY_CATEGORY Tag Values

SEMANTIC_CATEGORY Tag Values

IDENTIFIER

  • EMAIL

  • IBAN

  • IMEI

  • IP_ADDRESS

  • VIN

  • NAME

  • PAYMENT_CARD

  • PHONE_NUMBER (US numbers only)

  • URL

  • US_BANK_ACCOUNT

  • US_DRIVERS_LICENSE

  • US_PASSPORT

  • US_SSN

  • US_STREET_ADDRESS

QUASI_IDENTIFIER

  • AGE

  • GENDER

  • COUNTRY

  • DATE_OF_BIRTH

  • ETHNICITY

  • LATITUDE

  • LAT_LONG

  • LONGITUDE

  • MARITAL_STATUS

  • OCCUPATION

  • US_POSTAL_CODE

  • US_STATE_OR_TERRITORY

  • US_COUNTY

  • US_CITY

  • YEAR_OF_BIRTH

SENSITIVE

  • SALARY

Note

Multiple semantic tag string values from all three privacy categories can be considered “Sensitive Personal Data”, “Special Categories of Data”, or similar terms under laws and regulations, and might require additional protections or controls.

Currently, Classification does not tag column data as both sensitive and identifying. In other words, when setting the system tag on a given column, you must choose either the SEMANTIC_CATEGORY or PRIVACY_CATEGORY tag.

Interpret the Classification Output

The EXTRACT_SEMANTIC_CATEGORIES function returns a VARIANT in JSON format, which is a semi-structured format, containing:

  • Possible semantic and privacy categories greater than 0.8 will be listed outside of extra_info (refer to the EXTRACT_SEMANTIC_CATEGORIES function) and are also the ones that get applied by the ASSOCIATE_SEMANTIC_CATEGORY_TAGS stored procedure.

  • The probability that the classification process derived the correct semantic category.

  • A list of alternate semantic categories with which the column can be tagged if the probability is below the 0.80 threshold and the process identifies other possible semantic categories with a probability greater than 0.15. However, to use these values requires the data engineer to either manually set the system tag or revise the output.

The probability values can help a data engineer make an informed decision about whether the column contains personal or sensitive data: the higher the probability, the greater the likelihood that the column.

The following is a simplified example of Classify a Single Table example shown in Using Data Classification:

  1. Create a table:

    CREATE TABLE hr_data (
        age INTEGER,
        email_address VARCHAR,
        fname VARCHAR,
        lname VARCHAR
        );
    
    Copy
  2. Load data into the table and then call the EXTRACT_SEMANTIC_CATEGORIES function:

    USE ROLE data_engineer;
    
    SELECT EXTRACT_SEMANTIC_CATEGORIES('hr_data');
    
    Copy
  3. Evaluate the output. For example:

    Default output:

    SELECT EXTRACT_SEMANTIC_CATEGORIES('hr_data');
    +---------------------------------------------+
    | EXTRACT_SEMANTIC_CATEGORIES('HR_DATA')      |
    |---------------------------------------------|
    | {                                           |
    |   "AGE": {                                  |
    |     "extra_info": {                         |
    |       "alternates": [],                     |
    |       "probability": "1.00"                 |
    |     },                                      |
    |     "privacy_category": "QUASI_IDENTIFIER", |
    |     "semantic_category": "AGE"              |
    |   },                                        |
    |   "EMAIL_ADDRESS": {                        |
    |     "extra_info": {                         |
    |       "alternates": [],                     |
    |       "probability": "1.00"                 |
    |     },                                      |
    |     "privacy_category": "IDENTIFIER",       |
    |     "semantic_category": "EMAIL"            |
    |   },                                        |
    |   "FNAME": {                                |
    |     "extra_info": {                         |
    |       "alternates": [],                     |
    |       "probability": "1.00"                 |
    |     },                                      |
    |     "privacy_category": "IDENTIFIER",       |
    |     "semantic_category": "NAME"             |
    |   },                                        |
    |   "LNAME": {                                |
    |     "extra_info": {                         |
    |       "alternates": [],                     |
    |       "probability": "0.97"                 |
    |     },                                      |
    |     "privacy_category": "IDENTIFIER",       |
    |     "semantic_category": "NAME"             |
    |   }                                         |
    | }                                           |
    +---------------------------------------------+
    
    Copy

    Use the FLATTEN table function to enhance readability:

    SELECT
        f.key::varchar as column_name,
        f.value:"privacy_category"::varchar as privacy_category,  
        f.value:"semantic_category"::varchar as semantic_category,
        f.value:"extra_info":"probability"::number(10,2) as probability,
        f.value:"extra_info":"alternates"::variant as alternates
      FROM
      TABLE(FLATTEN(EXTRACT_SEMANTIC_CATEGORIES('hr_data')::VARIANT)) AS f;
    +---------------+------------------+-------------------+-------------+------------+
    | COLUMN_NAME   | PRIVACY_CATEGORY | SEMANTIC_CATEGORY | PROBABILITY | ALTERNATES |
    |---------------+------------------+-------------------+-------------+------------|
    | AGE           | QUASI_IDENTIFIER | AGE               |        1.00 | []         |
    | EMAIL_ADDRESS | IDENTIFIER       | EMAIL             |        1.00 | []         |
    | FNAME         | IDENTIFIER       | NAME              |        1.00 | []         |
    | LNAME         | IDENTIFIER       | NAME              |        0.97 | []         |
    +---------------+------------------+-------------------+-------------+------------+
    
    Copy

    Note that the values in some fields (e.g. the PROBABILITY field) vary depending upon the characteristics of the data and the sample size.

Supported Objects and Column Data Types

Snowflake supports classifying data stored in all types of tables and views, including external tables, materialized views, and secure views.

You can classify table and view columns for all supported data types except for the following data types:

  • GEOGRAPHY

  • BINARY

  • VARIANT

    Note that you can classify a column with the VARIANT data type when the column data type can be cast to a NUMBER or STRING data type. Snowflake does not classify the column if the column contains JSON, XML, or other semi-structured data.

If a table contains columns that are not of a supported data type or the column contains all NULL values, the classification process ignores the columns and does not include them in the output.

Important

If your data represents NULL values with a value other than NULL, the accuracy of the classification results may be impacted.

Compute Costs

The classification process requires compute resources, which are provided by the virtual warehouse that is in use and running when classification is performed.

The amount of time needed to classify the data in a table/view (and, therefore, the number of credits consumed by the warehouse) is a function of the amount of data to be classified.

In particular, if a table/view has a large number of columns that support classification, the processing time can be impacted. However, as a general rule, the processing speed scales linearly with the warehouse size. In other words, each size increase for a warehouse (e.g. X-small to Small) typically reduces the processing time by half.

Use the following general guidelines to select a warehouse size:

  • No concern for processing time: x-small warehouse.

  • Up to 100 columns in a table: small warehouse.

  • 101 to 300 columns in a table: medium warehouse.

  • 301 columns or more in a table: large warehouse.

Benefits

Classification provides the following benefits to data privacy and data governance administrators:

Data access

The results of classifying column data can inform identity and access management administrators to evaluate and maintain their Snowflake role hierarchies to ensure the Snowflake roles have the appropriate access to sensitive or PII data.

Data sharing

The classification process can help to identify and confirm the storage location of PII data. Subsequently, a data sharing provider can use the classification results to determine whether to share data and how to make the PII data available to a data sharing consumer.

Policy application

The usage of columns containing PII data, such as referencing columns in base tables to create a view or materialized view, can help to determine the best approach to protect the data with either a masking policy or a row access policy.

Recommendations

To capitalize on the Classification feature and optimize your PII data tracking capabilities, do the following:

Validation

Query Account Usage views first:

  • ACCESS_HISTORY: determine the table and view objects that are accessed most frequently.

  • OBJECT_DEPENDENCIES: determine metadata references between two or more objects.

Use the query results to prioritize schema-level or database-level assignment of the Classification system tags.

Column names

Use sensible column names in your table objects and train table creators to adhere to internal table creation guidelines.

Data types

Use sensible data types for columns. For example, an AGE column should have the NUMBER data type.

VARIANT

If a column has a VARIANT data type, use the FLATTEN command on the column prior to classifying the table.

Verify

Save the results of calling the EXTRACT_SEMANTIC_CATEGORIES function to a temporary table and double-check the results before assigning the Classification system tags.

After assigning the system tags, query the TAG_REFERENCES and COLUMNS Account Usage views to inspect columns that were not classified.

For example, grant the GOVERNANCE_VIEWER database role to the data_engineer custom role and run these statements:

use role accountadmin;

grant database role governance_viewer to role data_engineer;

use role data_engineer;

select table_name,
       table_catalog
from snowflake.account_usage.columns c
  where not exists(
    select * from snowflake.account_usage.tag_references t
    where c.column_id = t.column_id
  )
  and deleted is null
order by table_catalog;
Copy

Use the query results to determine which table catalogs (i.e. databases) are most important to double-check and prioritize classifying the tables.

Records

Create a historical record of the results of calling the EXTRACT_SEMANTIC_CATEGORIES function and the results of assigning the Classification system tags.

Warehouse

Use the proper warehouse size when classifying data. For details, refer to Compute Costs (in this topic).

Managing Classification

Privilege Reference

This table summarizes the minimum necessary privileges for the analyze and review steps for Classification:

Privilege

Object

Notes

USAGE

Database & Schema

This privilege is necessary for the parent database and schema of the table containing the columns to classify.

SELECT , UPDATE

Table

The SELECT privilege allows querying a table, which is necessary to classify column data and to call the EXTRACT_SEMANTIC_CATEGORIES function.

The UPDATE privilege allows calling the ASSOCIATE_SEMANTIC_CATEGORY_TAGS stored procedure to automatically apply the classification results to columns in a table.

The necessary privileges for the apply step, particularly setting the system tag, in Classification depend on the chosen approach. Currently, there are three options:

  • Use the GOVERNANCE_ADMIN database role.

  • Grant IMPORTED PRIVILEGES to a custom role.

  • Use the ACCOUNTADMIN role.

Database role

The database role approach uses the GOVERNANCE_ADMIN database role to set a Classification system tag on a table column. Additionally, the data engineer has the global APPLY TAG privilege to set tags on objects that they do not own.

To do this:

  1. Grant the GOVERNANCE_ADMIN database role to a custom role of your choice (e.g. data_engineer):

    use role accountadmin;
    grant database role snowflake.governance_admin to role data_engineer;
    grant apply tag on account to role data_engineer;
    
    Copy
  2. Assign a system tag to a table column. For example, the EMAIL column in the table named hr.tables.empl_info (i.e. a table of employee information in the human resources database):

    use role data_engineer;
    
    alter table hr.tables.empl_info
      modify column email
      set tag snowflake.core.semantic_category = 'email';
    
    Copy

Using the database role approach can help to satisfy a requirement for the principal of least privilege.

Custom role

The custom role (i.e. account-level role that you create) approach requires granting two privileges to a role of your choice. For example:

use role accountadmin;
grant imported privileges on database snowflake to role data_engineer;
grant apply tag on account to role data_engineer;
Copy

The result of these two grant statements allows access to the shared SNOWFLAKE database and applying the Classification system tags to any table column in the account. A user with the data_engineer role can then set the Classification system tag on a table as shown in the database role approach.

However, the two grant statements in this approach also allow:

  • Full access to query all views in the shared SNOWFLAKE database.

  • Set any tag in the account on any object that can be tagged in the account.

If the custom role approach does not satisfy the requirement for the principle of least privilege, use the database role approach.

ACCOUNTADMIN

Using the ACCOUNTADMIN role does not require any additional grant statements, however the ACCOUNTADMIN role is the most privileged role in Snowflake.

If there are restrictions to limit the usage of the ACCOUNTADMIN role in your production environments, use either the database role approach or the custom role approach.

Regarding the protection of column data, the privileges to assign a masking policy or a row access policy can be granted to the data_engineer role or a different role can be responsible for assigning the policies. The approach you take depends on how your administrators prefer to grant privileges and administer roles.

For representative examples, refer to Using Data Classification.

Tracking Classification System Tags

Snowflake provides built-in views and functions to track Classification system tag usage:

  • To find the columns with a system tag in your account, query the Account Usage TAG_REFERENCES view:

    SELECT * FROM SNOWFLAKE.ACCOUNT_USAGE.TAG_REFERENCES
        WHERE TAG_NAME = 'PRIVACY_CATEGORY'
        ORDER BY OBJECT_DATABASE, OBJECT_SCHEMA, OBJECT_NAME, COLUMN_NAME;
    
    Copy
  • To find the columns with a system tag for a table or view in a specific database, call the TAG_REFERENCES Information Schema table function:

    SELECT * FROM
      TABLE(
        my_db.information_schema.tag_references(
          'my_db.my_schema.hr_data.fname',
          'COLUMN'
        )
      )
    ;
    
    Copy
  • To find every tag set on every column in a table or view within a specific database, call the Information Schema TAG_REFERENCES_ALL_COLUMNS table function:

    select * from
      table(
        my_db.information_schema.tag_references_all_columns(
          'my_db.my_schema.hr_data',
          'table'
        )
      )
    ;
    
    Copy
  • To find a specific tag value for a column, call the SYSTEM$GET_TAG system function:

    select system$get_tag(
      'snowflake.core.privacy_category',
      'hr_data.fname',
      'COLUMN'
      )
    ;
    
    Copy

Next Topics: