Introduction to Classification

This topic provides information on how classification works.

Overview

Classification is a multi-step process that associates Snowflake-defined tags (i.e. system tags) to columns by analyzing the cells and metadata for personal data; this data can now be tracked by a data engineer.

Based on the tracking information and related audit processes, the data engineer can protect the column containing personal or sensitive data with a masking policy or the table containing this column with a row access policy.

The overall result of the classification and data protection steps is to facilitate compliance with data privacy regulations.

Process

This section explains how the Classification three-step process works. You can repeat this process as often as needed, such as after inserting new rows into a table or adding new columns to a table.

Classification simplifies to a three-step process: analyze, review, and apply. Each of these steps have different operations:

Analyze

The data engineer calls the EXTRACT_SEMANTIC_CATEGORIES function in the Snowflake account. This function analyzes columns in a table and outputs the possible categories and associated metadata. For details, refer to Interpret the output (in this topic).

Note

Complex view definitions, such as views that require JOIN operations on multiple objects, can slow the analyze process. Prior to classifying a complex view, consider the warehouse in use. For details, refer to Compute costs (in this topic).

Review

In the review step, the data engineer reviews the category results to ensure the results of the analyze step operations make sense. If no revisions are necessary, the data engineer can proceed to the apply step.

If revisions are necessary the data engineer can revise the output of the analyze step before moving to the apply step.

For details, refer to Interpret the output (in this topic).

Apply

The apply step has one operation: assign the system tag.

The data engineer can manually set a system tag on a column or call the ASSOCIATE_SEMANTIC_CATEGORY_TAGS stored procedure.

After assigning the system tags, the data engineer can then track the system tag and protect the personal or sensitive data with a masking policy or a row access policy.

For detailed examples of this process, refer to Use Data Classification.

System tags and categories

System tags are tags that Snowflake creates, maintains, and makes available in the shared SNOWFLAKE database. There are two Classification system tags, both of which exist in the SNOWFLAKE.CORE schema:

  • SNOWFLAKE.CORE.SEMANTIC_CATEGORY

  • SNOWFLAKE.CORE.PRIVACY_CATEGORY

The data engineer assigns these tags to a column containing personal or sensitive data.

String values

Snowflake stores the assignment of a system tag on a column as a key-value pair, where the value is a string. Snowflake defines the allowed string values for each classification system tag because Snowflake maintains each of these system tags.

The tag names, SEMANTIC_CATEGORY and PRIVACY_CATEGORY, correspond to the Classification categories that Snowflake assigns to the column data during the column sampling process (i.e. the tag names and category names use the same words):

Semantic category

The semantic category identifies personal attributes.

A non-exhaustive list of personal attributes Classification supports include name, age, and gender. These three attributes are possible string values when assigning the SEMANTIC_CATEGORY tag to a column.

Classification can detect information from different countries, such as Australia, Canada, and the United Kingdom. For example, if your table column contains phone number information, the analysis process can differentiate the different phone number values from each of these countries. For a complete list of supported international values, see System tags with international tag values (in this topic).

Privacy category

If the analysis determines that the column data corresponds to a semantic category, Snowflake further classifies the column to a privacy category. The privacy category has three values: identifier, quasi-identifier, or sensitive. These three values are the string values that can be specified when assigning the PRIVACY_CATEGORY Classification system tag to a column.

  • Identifier: These attributes uniquely identify an individual. Example attributes include name, social security number, and phone number.

    Identifier attributes are synonymous with direct identifiers.

  • Quasi-identifier: These attributes can uniquely identify an individual when two or more or these attributes are in combination. Example attributes include age and gender.

    Quasi-identifiers are synonymous with indirect identifiers.

  • Sensitive: These attributes are not considered enough to identify an individual but are information that the individual would rather not disclose for privacy reasons.

    Currently, the only attribute that Snowflake evaluates as sensitive is salary.

The following table summarizes the relationship between each classification category and system tag, and the string values for each classification system tag. The rows in this table indicate support for categories that do not have international values.

PRIVACY_CATEGORY Tag Values

SEMANTIC_CATEGORY Tag Values

IDENTIFIER

  • EMAIL

  • IBAN

  • IMEI

  • IP_ADDRESS

  • VIN

  • NAME

  • PAYMENT_CARD

  • URL

QUASI_IDENTIFIER

  • AGE

  • GENDER

  • COUNTRY

  • DATE_OF_BIRTH

  • ETHNICITY

  • LATITUDE

  • LAT_LONG

  • LONGITUDE

  • MARITAL_STATUS

  • OCCUPATION

  • YEAR_OF_BIRTH

SENSITIVE

  • SALARY

Note

Multiple semantic tag string values from all three privacy categories can be considered “Sensitive Personal Data”, “Special Categories of Data”, or similar terms under laws and regulations, and might require additional protections or controls.

Currently, Classification does not tag column data as both sensitive and identifying. In other words, when setting the system tag on a given column, you must choose either the SEMANTIC_CATEGORY or PRIVACY_CATEGORY tag.

System tags with international tag values

Snowflake supports international SEMANTIC_CATEGORY tag values that pertain to Australia, Canada, New Zealand, Switzerland, the United Kingdom, and the United States. To support these countries, the tag values correspond to certain parent category groups. A parent category contains information about the classification result, including whether the column consists of values largely from one country or another.

The following table summarizes the relationship between the classification tags, category groups and group members, and supported countries. The country codes are based on the ISO-3166-1 alpha-2 standard. Other semantic categories, such as EMAIL and GENDER, are not affected. To track international information, the data engineer uses the value in the SEMANTIC_CATEGORY Tag Values column when setting a system tag on a column.

PRIVACY_CATEGORY Tag Values

SEMANTIC_CATEGORY Tag Values (Parent Group)

Group Members

Country Code

IDENTIFIER

BANK_ACCOUNT

  • CA_BANK_ACCOUNT

  • NZ_BANK_ACCOUNT

  • US_BANK_ACCOUNT

  • IBAN

  • CA

  • NZ

  • US

ORGANIZATION_IDENTIFIER

  • AU_BUSINESS_NUMBER

  • AU_COMPANY_NUMBER

  • SG_UNIQUE_ENTITY_NUMBER

  • AU

  • AU

  • SG

DRIVERS_LICENSE

  • AU_DRIVERS_LICENSE

  • CA_DRIVERS_LICENSE

  • US_DRIVERS_LICENSE

  • AU

  • CA

  • US

MEDICARE_NUMBER

  • AU_MEDICARE_NUMBER

  • NZ_NHI_NUMBER

  • AU

  • NZ

PASSPORT

  • AU_PASSPORT

  • CA_PASSPORT

  • NZ_PASSPORT

  • SG_PASSPORT

  • US_PASSPORT

  • AU

  • CA

  • NZ

  • SG

  • US

PHONE_NUMBER

  • AU_PHONE_NUMBER

  • CA_PHONE_NUMBER

  • UK_PHONE_NUMBER

  • US_PHONE_NUMBER

  • AU

  • CA

  • GB

  • US

STREET_ADDRESS

  • CA_STREET_ADDRESS

  • US_STREET_ADDRESS

  • CA

  • US

TAX_IDENTIFIER

  • AU_TAX_NUMBER

  • NZ_INLAND_REVENUE_NUMBER

  • AU

  • NZ

NATIONAL_IDENTIFIER

  • CA_SOCIAL_INSURANCE_NUMBER

  • UK_NATIONAL_INSURANCE_NUMBER

  • SG_NATIONAL_REGISTRATION_IDENTITY_CARD

  • US_SSN

  • CA

  • GB

  • SG

  • US

QUASI_IDENTIFIER

CITY

  • CA_CITY

  • NZ_CITY

  • US_CITY

  • CA

  • NZ

  • US

POSTAL_CODE

  • AU_POSTAL_CODE

  • CA_POSTAL_CODE

  • CH_POSTAL_CODE

  • NZ_POSTAL_CODE

  • UK_POSTAL_CODE

  • US_POSTAL_CODE

  • AU

  • CA

  • CH

  • NZ

  • GB

  • US

ADMINISTRATIVE_AREA_1

  • CA_PROVINCE_OR_TERRITORY

  • NZ_REGION

  • US_STATE_OR_TERRITORY

  • CA

  • NZ

  • US

ADMINISTRATIVE_AREA_2

  • US_COUNTY

  • US

Interpret the output

The following is a simplified example of the Classify a single table example shown in Use Data Classification:

  1. Create a table:

    CREATE OR REPLACE TABLE hr_data (
        age INTEGER,
        email_address VARCHAR,
        fname VARCHAR,
        lname VARCHAR
        );
    
    Copy
  2. Load data into the table and then call the EXTRACT_SEMANTIC_CATEGORIES function:

    INSERT INTO hr_data (fname, lname, age, email_address) 
        VALUES 
            ('Thomas',    'Smith',   44, 'TS_NoAdvertising@SpamFilter.com'),
            ('Katherine', 'Lee',     29, 'KL_JunkMailFolder@GatingEmail.com'),
            ('Lisa',      'Curie',   29, 'LC_03@GatingEmail.com'),
            ('Albert',    'Meitner', 14, 'AM_bucket@GatingEmail.com'),
            ('Horace',    'Jones',   53, 'HJ_track@SpamFilter.com'),
            ('Natalia',   'Chavez',  38, 'NataliaC@DoNotTrack.org');
    
    Copy
    USE ROLE data_engineer;
    
    SELECT EXTRACT_SEMANTIC_CATEGORIES('hr_data');
    
    Copy
  3. Evaluate the default output. For example:

    +-----------------------------------------------+                               
    | EXTRACT_SEMANTIC_CATEGORIES('HR_DATA')        |
    |-----------------------------------------------|
    | {                                             |
    |   "AGE": {                                    |
    |     "alternates": [],                         |
    |     "recommendation": {                       |
    |       "confidence": "HIGH",                   |
    |       "coverage": 1,                          |
    |       "details": [],                          |
    |       "privacy_category": "QUASI_IDENTIFIER", |
    |       "semantic_category": "AGE"              |
    |     },                                        |
    |     "valid_value_ratio": 1                    |
    |   },                                          |
    |   "EMAIL_ADDRESS": {                          |
    |     "alternates": [],                         |
    |     "recommendation": {                       |
    |       "confidence": "HIGH",                   |
    |       "coverage": 1,                          |
    |       "details": [],                          |
    |       "privacy_category": "IDENTIFIER",       |
    |       "semantic_category": "EMAIL"            |
    |     },                                        |
    |     "valid_value_ratio": 1                    |
    |   },                                          |
    |   "FNAME": {                                  |
    |     "alternates": [],                         |
    |     "recommendation": {                       |
    |       "confidence": "HIGH",                   |
    |       "coverage": 1,                          |
    |       "details": [],                          |
    |       "privacy_category": "IDENTIFIER",       |
    |       "semantic_category": "NAME"             |
    |     },                                        |
    |     "valid_value_ratio": 1                    |
    |   },                                          |
    |   "LNAME": {                                  |
    |     "alternates": [],                         |
    |     "recommendation": {                       |
    |       "confidence": "HIGH",                   |
    |       "coverage": 1,                          |
    |       "details": [],                          |
    |       "privacy_category": "IDENTIFIER",       |
    |       "semantic_category": "NAME"             |
    |     },                                        |
    |     "valid_value_ratio": 1                    |
    |   }                                           |
    | }                                             |
    +-----------------------------------------------+
    

    Use the FLATTEN table function to enhance readability:

    SELECT
        f.key::varchar as column_name,
        f.value:"recommendation":"privacy_category"::varchar as privacy_category,  
        f.value:"recommendation":"semantic_category"::varchar as semantic_category,
        f.value:"recommendation":"confidence"::varchar as confidence,
        f.value:"recommendation":"coverage"::number(10,2) as coverage,
        f.value:"details"::variant as details,
        f.value:"alternates"::variant as alternates
      FROM
      TABLE(FLATTEN(EXTRACT_SEMANTIC_CATEGORIES('hr_data')::VARIANT)) AS f;
    
    Copy
    +---------------+------------------+-------------------+------------+----------+---------+------------+
    | COLUMN_NAME   | PRIVACY_CATEGORY | SEMANTIC_CATEGORY | CONFIDENCE | COVERAGE | DETAILS | ALTERNATES |
    |---------------+------------------+-------------------+------------+----------+---------+------------|
    | AGE           | QUASI_IDENTIFIER | AGE               | HIGH       |     1.00 | NULL    | []         |
    | EMAIL_ADDRESS | IDENTIFIER       | EMAIL             | HIGH       |     1.00 | NULL    | []         |
    | FNAME         | IDENTIFIER       | NAME              | HIGH       |     1.00 | NULL    | []         |
    | LNAME         | IDENTIFIER       | NAME              | HIGH       |     1.00 | NULL    | []         |
    +---------------+------------------+-------------------+------------+----------+---------+------------+
    

Supported objects and column data types

Snowflake supports classifying data stored in all types of tables and views, including external tables, materialized views, and secure views.

You can classify table and view columns for all supported data types except for the following data types:

  • GEOGRAPHY

  • BINARY

  • VARIANT

    Note that you can classify a column with the VARIANT data type when the column data type can be cast to a NUMBER or STRING data type. Snowflake does not classify the column if the column contains JSON, XML, or other semi-structured data.

If a table contains columns that are not of a supported data type or the column contains all NULL values, the classification process ignores the columns and does not include them in the output.

Important

If your data represents NULL values with a value other than NULL, the accuracy of the classification results may be impacted.

Compute costs

The classification process requires compute resources, which are provided by the virtual warehouse that is in use and running when classification is performed.

The amount of time needed to classify the data in a table/view (and, therefore, the number of credits consumed by the warehouse) is a function of the amount of data to be classified.

In particular, if a table/view has a large number of columns that support classification, the processing time can be impacted. However, as a general rule, the processing speed scales linearly with the warehouse size. In other words, each size increase for a warehouse (e.g. X-small to Small) typically reduces the processing time by half.

Use the following general guidelines to select a warehouse size:

  • No concern for processing time: x-small warehouse.

  • Up to 100 columns in a table: small warehouse.

  • 101 to 300 columns in a table: medium warehouse.

  • 301 columns or more in a table: large warehouse.

Benefits

Classification provides the following benefits to data privacy and data governance administrators:

Data access

The results of classifying column data can inform identity and access management administrators to evaluate and maintain their Snowflake role hierarchies to ensure the Snowflake roles have the appropriate access to sensitive or PII data.

Data sharing

The classification process can help to identify and confirm the storage location of PII data. Subsequently, a data sharing provider can use the classification results to determine whether to share data and how to make the PII data available to a data sharing consumer.

Policy application

The usage of columns containing PII data, such as referencing columns in base tables to create a view or materialized view, can help to determine the best approach to protect the data with either a masking policy or a row access policy.

Recommendations

To capitalize on the Classification feature and optimize your PII data tracking capabilities, do the following:

Validation

Query Account Usage views first:

  • ACCESS_HISTORY: determine the table and view objects that are accessed most frequently.

  • OBJECT_DEPENDENCIES: determine metadata references between two or more objects.

Use the query results to prioritize schema-level or database-level assignment of the Classification system tags.

Column names

Use sensible column names in your table objects and train table creators to adhere to internal table creation guidelines.

Data types

Use sensible data types for columns. For example, an AGE column should have the NUMBER data type.

VARIANT

If a column has a VARIANT data type, use the FLATTEN command on the column prior to classifying the table.

Verify

Save the results of calling the EXTRACT_SEMANTIC_CATEGORIES function to a temporary table and double-check the results before assigning the Classification system tags.

After assigning the system tags, query the TAG_REFERENCES and COLUMNS Account Usage views to inspect columns that were not classified.

For example, grant the GOVERNANCE_VIEWER database role to the data_engineer custom role and run these statements:

use role accountadmin;

grant database role governance_viewer to role data_engineer;

use role data_engineer;

select table_name,
       table_catalog
from snowflake.account_usage.columns c
  where not exists(
    select * from snowflake.account_usage.tag_references t
    where c.column_id = t.column_id
  )
  and deleted is null
order by table_catalog;
Copy

Use the query results to determine which table catalogs (i.e. databases) are most important to double-check and prioritize classifying the tables.

Records

Create a historical record of the results of calling the EXTRACT_SEMANTIC_CATEGORIES function and the results of assigning the Classification system tags.

Warehouse

Use the proper warehouse size when classifying data. For details, refer to Compute costs (in this topic).

Manage Classification

Privilege reference

This table summarizes the minimum necessary privileges for the analyze and review steps for Classification:

Privilege

Object

Notes

USAGE

Database & Schema

This privilege is necessary for the parent database and schema of the table containing the columns to classify.

SELECT , UPDATE

Table

The SELECT privilege allows querying a table, which is necessary to classify column data and to call the EXTRACT_SEMANTIC_CATEGORIES function.

The UPDATE privilege allows calling the ASSOCIATE_SEMANTIC_CATEGORY_TAGS stored procedure to automatically apply the classification results to columns in a table.

The necessary privileges for the apply step, particularly setting the system tag, in Classification depend on the chosen approach. Currently, there are three options:

  • Use the GOVERNANCE_ADMIN database role.

  • Grant IMPORTED PRIVILEGES to a custom role.

  • Use the ACCOUNTADMIN role.

Database role

The database role approach uses the GOVERNANCE_ADMIN database role to set a Classification system tag on a table column. Additionally, the data engineer has the global APPLY TAG privilege to set tags on objects that they do not own.

To do this:

  1. Grant the GOVERNANCE_ADMIN database role to a custom role of your choice (e.g. data_engineer):

    use role accountadmin;
    grant database role snowflake.governance_admin to role data_engineer;
    grant apply tag on account to role data_engineer;
    
    Copy
  2. Assign a system tag to a table column. For example, the EMAIL column in the table named hr.tables.empl_info (i.e. a table of employee information in the human resources database):

    use role data_engineer;
    
    alter table hr.tables.empl_info
      modify column email
      set tag snowflake.core.semantic_category = 'email';
    
    Copy

Using the database role approach can help to satisfy a requirement for the principal of least privilege.

Custom role

The custom role (i.e. account-level role that you create) approach requires granting two privileges to a role of your choice. For example:

use role accountadmin;
grant imported privileges on database snowflake to role data_engineer;
grant apply tag on account to role data_engineer;
Copy

The result of these two grant statements allows access to the shared SNOWFLAKE database and applying the Classification system tags to any table column in the account. A user with the data_engineer role can then set the Classification system tag on a table as shown in the database role approach.

However, the two grant statements in this approach also allow:

  • Full access to query all views in the shared SNOWFLAKE database.

  • Set any tag in the account on any object that can be tagged in the account.

If the custom role approach does not satisfy the requirement for the principle of least privilege, use the database role approach.

ACCOUNTADMIN

Using the ACCOUNTADMIN role does not require any additional grant statements, however the ACCOUNTADMIN role is the most privileged role in Snowflake.

If there are restrictions to limit the usage of the ACCOUNTADMIN role in your production environments, use either the database role approach or the custom role approach.

Regarding the protection of column data, the privileges to assign a masking policy or a row access policy can be granted to the data_engineer role or a different role can be responsible for assigning the policies. The approach you take depends on how your administrators prefer to grant privileges and administer roles.

For representative examples, refer to Use Data Classification.

Tracking system tags

Snowflake provides built-in views and functions to track Classification system tag usage:

  • To find the columns with a system tag in your account, query the Account Usage TAG_REFERENCES view:

    SELECT * FROM SNOWFLAKE.ACCOUNT_USAGE.TAG_REFERENCES
        WHERE TAG_NAME = 'PRIVACY_CATEGORY'
        ORDER BY OBJECT_DATABASE, OBJECT_SCHEMA, OBJECT_NAME, COLUMN_NAME;
    
    Copy
  • To find the columns with a system tag for a table or view in a specific database, call the TAG_REFERENCES Information Schema table function:

    SELECT * FROM
      TABLE(
        my_db.information_schema.tag_references(
          'my_db.my_schema.hr_data.fname',
          'COLUMN'
        )
      )
    ;
    
    Copy
  • To find every tag set on every column in a table or view within a specific database, call the Information Schema TAG_REFERENCES_ALL_COLUMNS table function:

    select * from
      table(
        my_db.information_schema.tag_references_all_columns(
          'my_db.my_schema.hr_data',
          'table'
        )
      )
    ;
    
    Copy
  • To find a specific tag value for a column, call the SYSTEM$GET_TAG system function:

    select system$get_tag(
      'snowflake.core.privacy_category',
      'hr_data.fname',
      'COLUMN'
      )
    ;
    
    Copy

Next Topics: