Custom Data Classification

This topic provides concepts on Custom Data Classification in Snowflake.

Overview

Snowflake provides the CUSTOM_CLASSIFIER class in the SNOWFLAKE.DATA_PRIVACY schema to enable data engineers to extend their data classification capabilities based on their own knowledge of their data. After you create an instance of the class, you can call a method on the instance to define your own semantic category, specify the privacy category, and specify regular expressions to match column value patterns while optionally matching the column name.

By creating and using custom classification instances, you can:

  • Accelerate your data classification efforts.

  • Define industry- and domain-specific tags for columns containing sensitive data.

  • Leverage Snowflake to have more control over your efforts to track PII data.

About the custom classification algorithm

Snowflake uses an algorithm that is unique for custom classification compared to the algorithm for Data Classification. The reason for having different classification algorithms is to ensure stable results depending on how you choose to classify your data.

The custom classification algorithm uses a scoring rule to determine which semantic category system tag to recommend and which semantic category tags, if any, to suggest as alternatives. The scoring logic evaluates the regular expressions that you add to your instance, which you specify by calling the custom_classifier!ADD_REGEX method on your instance.

The scoring rule uses a default threshold value that equates to high confidence in terms of what the recommended tag should be. The algorithm compares the score for a column against the threshold value and recommends a tag that corresponds to one of the following:

The following table summarizes the scoring algorithm and the recommended tag:

Name Matcher Provided

Value matches >= threshold

Name matches

Recommendation

True

True

True

Custom category

False

True

Snowflake category

True

False

Snowflake category

False

False

Snowflake category

False

True

Not applicable

Custom category

False

Not applicable

Snowflake category

Limitations

Currently, you can only use SQL to run custom classifiers; you cannot use Snowsight to run custom classifiers.

Working with custom classifiers

Custom classification uses the CUSTOM_CLASSIFIER class. You can create an instance of the CUSTOM_CLASSIFIER class and call instance methods to manage regular expressions associated with the instance. The following methods and SQL commands are supported:

Additionally, you can use the following roles with custom classification:

  • SNOWFLAKE.CLASSIFICATION_ADMIN: database role that enables you to create a custom classification instance.

  • custom_classifier!PRIVACY_USER: instance role that enables you to do the following actions:

    • Add a category to the instance by calling the ADD_REGEX method on the instance.

    • Delete a category from an instance by calling the DELETE_CATEGORY method on the instance.

  • The account role with the OWNERSHIP privilege on the instance can:

    • Drop the instance with a DROP CUSTOM_CLASSIFIER command.

    • List instances with a SHOW CUSTOM_CLASSIFIER command.

You can grant the instance roles to account roles and database roles to enable other users to work with custom classification instances. For example:

GRANT SNOWFLAKE.DATA_PRIVACY.CUSTOM_CLASSIFIER ROLE <name>!USER
  TO ROLE <role_name>

REVOKE SNOWFLAKE.DATA_PRIVACY.CUSTOM_CLASSIFIER ROLE <name>!USER
  FROM ROLE <role_name>

GRANT SNOWFLAKE.DATA_PRIVACY.CUSTOM_CLASSIFIER ROLE <name>!USER
  TO DATABASE ROLE <database_role_name>

REVOKE SNOWFLAKE.DATA_PRIVACY.CUSTOM_CLASSIFIER ROLE <name>!USER
  FROM DATABASE ROLE <database_role_name>
Copy

Where:

name

Specifies the name of the custom classification instance that you create.

role_name

Specifies the name of an account role.

database_role_name

Specifies the name of a database role.

Similar to other Snowflake objects, the role that creates the custom classification instance is automatically granted the OWNERSHIP privilege on the instance.

The high-level approach to classify data with custom classifiers is as follows:

  1. Identify a table to classify.

  2. Use SQL to do the following:

    1. Create a custom classification instance.

    2. Add the system tag categories and regular expressions to the instance.

    3. Classify the table.

Example

Complete these steps to create a custom classifier to classify a table:

  1. Consider a table, data.tables.patient_diagnosis, in which one of its columns contains diagnostic codes, such as ICD-10 codes.

    +-------------+----------------------------------------------------------+
    | ICD_10_CODE | DESCRIPTION                                              |
    +-------------+----------------------------------------------------------+
    | G30.9       | Alzheimer's disease, unspecified                         |
    | G80.8       | Other cerebral palsy                                     |
    | S13.4XXA    | Sprain of ligaments of cervical spine, initial encounter |
    +-------------+----------------------------------------------------------+
    
    Copy

    This table might also include columns to identify patients, such as first and last name, unique health insurance identifiers, and date of birth, that were treated at a medical facility. The data owner can classify the table to ensure that the columns are tagged correctly so the table can be monitored.

    In this example, the data owner already has these privileges granted to their role:

    • OWNERSHIP on the table to classify.

    • OWNERSHIP on the schema that contains the table.

    • USAGE on the database that contains the schema and table.

  2. Enable the data owner to classify the table by granting the SNOWFLAKE.CLASSIFICATION_ADMIN database role to the data owner role:

    USE ROLE ACCOUNTADMIN;
    GRANT DATABASE ROLE SNOWFLAKE.CLASSIFICATION_ADMIN
      TO ROLE data_owner;
    
    Copy
  3. As the data owner, create a schema to store your custom classification instances:

    USE ROLE data_owner;
    CREATE SCHEMA data.classifiers;
    
    Copy
  4. Use the CREATE CUSTOM_CLASSIFIER command to create a custom classification instance in the data.classifiers schema:

    USE SCHEMA data.classifiers;
    CREATE OR REPLACE SNOWFLAKE.DATA_PRIVACY.CUSTOM_CLASSIFIER medical_codes();
    
    Copy

    Update your search path to make it easier to use the instance.

  5. Use a SHOW CUSTOM_CLASSIFIER command to list each instance that you create. For example:

    SHOW SNOWFLAKE.DATA_PRIVACY.CUSTOM_CLASSIFIER;
    
    Copy

    Returns:

    +----------------------------------+---------------+---------------+-------------+-----------------+---------+-------------+
    | created_on                       | name          | database_name | schema_name | current_version | comment | owner       |
    +----------------------------------+---------------+---------------+-------------+-----------------+---------+-------------+
    | 2023-09-08 07:00:00.123000+00:00 | MEDICAL_CODES | DATA          | CLASSIFIERS | 1.0             | None    | DATA_OWNER  |
    +----------------------------------+---------------+---------------+-------------+-----------------+---------+-------------+
    
  6. Call the custom_classifier!ADD_REGEX method on the instance to specify the system tags and regular expression to identify ICD-10 codes in a column. The regular expression in this example matches all possible ICD-10 codes. The regular expression to match the column name, ICD*, and the comment are optional:

    CALL medical_codes!ADD_REGEX(
      'ICD_10_CODES',
      'IDENTIFIER',
      '[A-TV-Z][0-9][0-9AB]\.?[0-9A-TV-Z]{0,4}',
      'ICD*',
      'Add a regex to identify ICD-10 medical codes in a column'
    );
    
    Copy

    Returns:

    +---------------+
    |   ADD_REGEX   |
    +---------------+
    | ICD_10_CODES  |
    +---------------+
    

    Tip

    Test the regular expression before adding a regular expression to the custom classification instance. For example:

    SELECT icd_10_code
    FROM medical_codes
    WHERE icd_10_code REGEXP('[A-TV-Z][0-9][0-9AB]\.?[0-9A-TV-Z]{0,4}');
    
    Copy
    +-------------+
    | ICD-10-CODE |
    +-------------+
    | G30.9       |
    | G80.8       |
    | S13.4XXA    |
    +-------------+
    

    In this query, only valid values that match the regular expression are returned. The query does not return invalid values such as xyz.

    For details, see String Functions (Regular Expressions).

  7. Call the custom_classifier!LIST method on the instance to verify the regular expression that you added to the instance:

    SELECT medical_codes!LIST();
    
    Copy

    Returns:

    +--------------------------------------------------------------------------------+
    | MEDICAL_CODES!LIST()                                                           |
    +--------------------------------------------------------------------------------+
    | {                                                                              |
    |   "ICD-10-CODES": {                                                            |
    |     "col_name_regex": "ICD*",                                                  |
    |     "description": "Add a regex to identify ICD-10 medical codes in a column", |
    |     "privacy_category": "IDENTIFIER",                                          |
    |     "value_regex": "[A-TV-Z][0-9][0-9AB]\.?[0-9A-TV-Z]{0,4}"                   |
    | }                                                                              |
    +--------------------------------------------------------------------------------+
    

    To remove a category, call the custom_classifier!DELETE_CATEGORY method on the instance.

  8. Use SQL to classify the table. For details, see Using SQL to classify tables in a schema.

  9. If the instance is no longer needed, use the DROP CUSTOM_CLASSIFIER command to remove a custom classification instance from the system:

    DROP SNOWFLAKE.DATA_PRIVACY.CUSTOM_CLASSIFIER data.classifiers.medical_codes;
    
    Copy