Using custom classifiers

Custom classification uses the CUSTOM_CLASSIFIER class to enable data engineers to extend their data classification capabilities based on their own knowledge of their data. You can create an instance of the CUSTOM_CLASSIFIER class in the database and schema of your choice and call instance methods to manage regular expressions associated with the instance.

Commands and methods

The following methods and SQL commands are supported:

Access control

These sections summarize the roles and grants on various objects that you need to use an instance.

Roles

You can use the following roles with custom classification:

  • SNOWFLAKE.CLASSIFICATION_ADMIN: database role that enables you to create a custom classification instance.

  • custom_classifier!PRIVACY_USER: instance role that enables you to call the following methods on the instance:

    • ADD_REGEX

    • LIST

    • DELETE_CATEGORY

  • The account role with the OWNERSHIP privilege on the instance can run these commands:

    • DROP CUSTOM_CLASSIFIER

    • SHOW CUSTOM_CLASSIFIER

Grants

To create and manage instances, you can choose to either grant the CREATE SNOWFLAKE.DATA_PRIVACY.CUSTOM_CLASSIFIER privilege to a role or grant the PRIVACY_USER instance role to a role.

You can grant the instance roles to account roles and database roles to enable other users to work with custom classification instances:

GRANT SNOWFLAKE.DATA_PRIVACY.CUSTOM_CLASSIFIER ROLE <name>!PRIVACY_USER
  TO ROLE <role_name>

REVOKE SNOWFLAKE.DATA_PRIVACY.CUSTOM_CLASSIFIER ROLE <name>!PRIVACY_USER
  FROM ROLE <role_name>

GRANT SNOWFLAKE.DATA_PRIVACY.CUSTOM_CLASSIFIER ROLE <name>!PRIVACY_USER
  TO DATABASE ROLE <database_role_name>

REVOKE SNOWFLAKE.DATA_PRIVACY.CUSTOM_CLASSIFIER ROLE <name>!PRIVACY_USER
  FROM DATABASE ROLE <database_role_name>
Copy

Where:

name

Specifies the name of the custom classification instance.

role_name

Specifies the name of an account role.

database_role_name

Specifies the name of a database role.

You must use a warehouse to call methods on the instance. However, you cannot use the GRANT OWNERSHIP command to transfer ownership on the instance to a different role or database role.

To grant the custom role my_classification_role the required instance role and privileges to create and use an instance of the CUSTOM_CLASSIFIER class, execute the following statements:

USE ROLE ACCOUNTADMIN;
GRANT DATABASE ROLE SNOWFLAKE.CLASSIFICATION_ADMIN
  TO ROLE my_classification_role;
GRANT USAGE ON DATABASE mydb TO ROLE my_classification_role;
GRANT USAGE ON SCHEMA mydb.instances TO ROLE my_classification_role;
GRANT USAGE ON WAREHOUSE wh_classification TO ROLE my_classification_role;
Copy

If you would like to enable a specific role, such as data_analyst to use a specific instance, do the following:

GRANT SNOWFLAKE.DATA_PRIVACY.CUSTOM_CLASSIFIER ROLE
  mydb.sch.my_instance!PRIVACY_USER TO ROLE data_analyst;
Copy

Example

The high-level approach to classify data with custom classifiers is as follows:

  1. Identify a table to classify.

  2. Use SQL to do the following:

    1. Create a custom classification instance.

    2. Add the system tag categories and regular expressions to the instance.

    3. Classify the table.

Complete these steps to create a custom classifier to classify a table:

  1. Consider a table, data.tables.patient_diagnosis, in which one of its columns contains diagnostic codes, such as ICD-10 codes.

    +-------------+----------------------------------------------------------+
    | ICD_10_CODE | DESCRIPTION                                              |
    +-------------+----------------------------------------------------------+
    | G30.9       | Alzheimer's disease, unspecified                         |
    | G80.8       | Other cerebral palsy                                     |
    | S13.4XXA    | Sprain of ligaments of cervical spine, initial encounter |
    +-------------+----------------------------------------------------------+
    
    Copy

    This table might also include columns to identify patients, such as first and last name, unique health insurance identifiers, and date of birth, that were treated at a medical facility. The data owner can classify the table to ensure that the columns are tagged correctly so the table can be monitored.

    In this example, the data owner already has these privileges granted to their role:

    • OWNERSHIP on the table to classify.

    • OWNERSHIP on the schema that contains the table.

    • USAGE on the database that contains the schema and table.

  2. Enable the data owner to classify the table by granting the SNOWFLAKE.CLASSIFICATION_ADMIN database role to the data owner role:

    USE ROLE ACCOUNTADMIN;
    GRANT DATABASE ROLE SNOWFLAKE.CLASSIFICATION_ADMIN
      TO ROLE data_owner;
    
    Copy
  3. As the data owner, create a schema to store your custom classification instances:

    USE ROLE data_owner;
    CREATE SCHEMA data.classifiers;
    
    Copy
  4. Use the CREATE CUSTOM_CLASSIFIER command to create a custom classification instance in the data.classifiers schema:

    CREATE OR REPLACE SNOWFLAKE.DATA_PRIVACY.CUSTOM_CLASSIFIER medical_codes();
    
    Copy

    You can optionally update your search path as follows:

    • Add SNOWFLAKE.DATA_PRIVACY so that you don’t have to specify the fully qualified name of the class when creating a new instance of the class.

    • Add DATA.CLASSIFIERS so that you don’t have to specify the fully qualified name of the instance when calling a method on the instance or using a command with the instance.

  5. Use a SHOW CUSTOM_CLASSIFIER command to list each instance that you create. For example:

    SHOW SNOWFLAKE.DATA_PRIVACY.CUSTOM_CLASSIFIER;
    
    Copy

    Returns:

    +----------------------------------+---------------+---------------+-------------+-----------------+---------+-------------+
    | created_on                       | name          | database_name | schema_name | current_version | comment | owner       |
    +----------------------------------+---------------+---------------+-------------+-----------------+---------+-------------+
    | 2023-09-08 07:00:00.123000+00:00 | MEDICAL_CODES | DATA          | CLASSIFIERS | 1.0             | None    | DATA_OWNER  |
    +----------------------------------+---------------+---------------+-------------+-----------------+---------+-------------+
    
  6. Call the custom_classifier!ADD_REGEX method on the instance to specify the system tags and regular expression to identify ICD-10 codes in a column. The regular expression in this example matches all possible ICD-10 codes. The regular expression to match the column name, ICD.*, and the comment are optional:

    CALL medical_codes!ADD_REGEX(
      'ICD_10_CODES',
      'IDENTIFIER',
      '[A-TV-Z][0-9][0-9AB]\.?[0-9A-TV-Z]{0,4}',
      'ICD.*',
      'Add a regex to identify ICD-10 medical codes in a column',
      0.8,
    );
    
    Copy

    Returns:

    +---------------+
    |   ADD_REGEX   |
    +---------------+
    | ICD_10_CODES  |
    +---------------+
    

    Tip

    Test the regular expression before adding a regular expression to the custom classification instance. For example:

    SELECT icd_10_code
    FROM medical_codes
    WHERE icd_10_code REGEXP('[A-TV-Z][0-9][0-9AB]\.?[0-9A-TV-Z]{0,4}');
    
    Copy
    +-------------+
    | ICD-10-CODE |
    +-------------+
    | G30.9       |
    | G80.8       |
    | S13.4XXA    |
    +-------------+
    

    In this query, only valid values that match the regular expression are returned. The query does not return invalid values such as xyz.

    For details, see String functions (regular expressions).

  7. Call the custom_classifier!LIST method on the instance to verify the regular expression that you added to the instance:

    SELECT medical_codes!LIST();
    
    Copy

    Returns:

    +--------------------------------------------------------------------------------+
    | MEDICAL_CODES!LIST()                                                           |
    +--------------------------------------------------------------------------------+
    | {                                                                              |
    |   "ICD-10-CODES": {                                                            |
    |     "col_name_regex": "ICD.*",                                                 |
    |     "description": "Add a regex to identify ICD-10 medical codes in a column", |
    |     "privacy_category": "IDENTIFIER",                                          |
    |     "threshold": 0.8,                                                          |
    |     "value_regex": "[A-TV-Z][0-9][0-9AB]\.?[0-9A-TV-Z]{0,4}"                   |
    |   }                                                                            |
    | }                                                                              |
    +--------------------------------------------------------------------------------+
    

    To remove a category, call the custom_classifier!DELETE_CATEGORY method on the instance.

  8. Use SQL to classify the table. For details, see Using SQL to asynchronously classify tables in a schema.

  9. If the instance is no longer needed, use the DROP CUSTOM_CLASSIFIER command to remove a custom classification instance from the system:

    DROP SNOWFLAKE.DATA_PRIVACY.CUSTOM_CLASSIFIER data.classifiers.medical_codes;
    
    Copy

Auditing custom classifiers

You can use the following queries to audit the creation of custom classification instances, adding regular expressions to instances, and dropping the instance.

  • To audit the creation of custom classification instances, use the following query:

    SELECT * FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
    WHERE query_text ILIKE 'create % snowflake.data_privacy.custom_classifier%';
    
    Copy
  • To audit adding regular expressions to a specific instance, use the following query and replace DB.SCH.MY_INSTANCE with the name of the instance that you want to audit:

    SELECT
        QUERY_HISTORY.user_name,
        QUERY_HISTORY.role_name,
        QUERY_HISTORY.query_text,
        QUERY_HISTORY.query_id
      FROM
        SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY query_history,
        SNOWFLAKE.ACCOUNT_USAGE.ACCESS_HISTORY access_history,
          TABLE(FLATTEN(input => access_history.direct_objects_accessed)) flattened_value
    WHERE flattened_value.value:"objectName" = 'DB.SCH.MY_INSTANCE!ADD_REGEX'
    AND QUERY_HISTORY.query_id = ACCESS_HISTORY.query_id;
    
    Copy
  • To audit dropping a custom classification instance, use the following query:

    SELECT * FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
    WHERE query_text ILIKE 'drop % snowflake.data_privacy.custom_classifier%';
    
    Copy