Classify sensitive data automatically

Automatic sensitive data classification is a serverless feature that enables the automatic detection and tagging of sensitive data. The feature continuously monitors tables within a specific schema and classifies their columns using native and custom classification categories.

Automatic sensitive data classification lets data engineers and stewards do the following:

  • Demonstrate how automatically classifying tables meets internal governance and compliance needs.

  • Ensure sensitive data is properly tagged.

  • Ensure the right access controls are in place to protect the sensitive data.

Get started

The basic workflow to automatically classify sensitive data consists of the following:

  1. Create a classification profile that controls how often sensitive data in a schema is automatically classified, including whether system tags should be automatically applied after classification.

  2. Optionally, use the classification profile to map user-defined tags to system tags so a column with sensitive data can be associated with a user-defined tag based on its classification.

  3. Optionally, add a custom classifier to the classification profile so sensitive data can be automatically classified with user-defined semantic and privacy categories.

  4. Set the classification profile on a schema so that tables in the schema get automatically classified.

For end-to-end examples of this workflow, see Examples.

About classification profiles

A data engineer creates a classification profile by creating an instance of the CLASSIFICATION_PROFILE class to define the criteria that are used to automatically classify tables in a schema. This criteria includes:

  • How long a table should exist before automatically classifying it.

  • How long before previously classified tables should be reclassified.

  • Whether system and custom tags are automatically set on columns after the classification. You can decide whether you want Snowflake to automatically apply suggested tags or prefer to review proposed tag assignments, then apply them yourself.

  • A mapping between system classification tags and user-defined object tags so the user-defined tags can be applied automatically.

When the data engineer assigns the classification profile to a schema, sensitive data in the tables of the schema are automatically classified on the schedule defined by the profile. A data engineer can assign the same classification profile to multiple schemas, or can create multiple classification profiles if there is a need to set different classification criteria for different schemas.

The process of automatically classifying data requires access to the raw data in the table. The raw data includes tables that have a masking policy assigned to a column. However, Snowflake preserves the intention of regulating access to protected data by using an internal role to automatically classify data. The internal role can access data protected by a masking policy, but this role is not accessible to users.

For an example of using the CREATE CLASSIFICATION_PROFILE command to create a classification profile, see Examples.

About tag mapping

You can use the classification profile to map SEMANTIC_CATEGORY system tags to one or more user-defined tags. This tag mapping allows a column with sensitive data to be automatically assigned a user-defined tag based on its classification. The tag map can be added while creating the classification profile or later by calling the <classification_profile_name>!SET_TAG_MAP method.

Because user-defined object tags can have a masking policy associated with them, you can use a tag map to enable automatic tag-based masking. If you choose to automatically apply tags after classification, you can automate the entire process of protecting columns with a masking policy based on the classification of data. As new data is added to a schema, the tag-based masking policies will be automatically assigned to the columns that contain sensitive data.

Regardless of whether you are defining the tag map while creating the classification profile or after, the contents of the map are specified as a JSON object. This JSON object contains the 'column_tag_map' key, which is an array of objects that specify a user-defined tag, the string value of that tag, and the semantic categories to which the tag is being mapped. After the tag map is associated with a classification profile and you automatically classify tables in a schema, the tag is assigned to the columns that correspond to the semantic categories.

The following is an example of a tag map:

'tag_map': {
  'column_tag_map': [
    {
      'tag_name':'tag_db.sch.pii',
      'tag_value':'Highly Confidential',
      'semantic_categories':[
        'NAME',
        'NATIONAL_IDENTIFIER'
      ]
    },
    {
      'tag_name': 'tag_db.sch.pii',
      'tag_value':'Confidential',
      'semantic_categories': [
        'EMAIL'
      ]
    }
  ]
}
Copy

Based on this mapping, if you have a column of email addresses and the classification process determines that the column contains these addresses, the tag_db.sch.pii = 'Confidential' tag is set on the column containing the email addresses.

If your tag map includes multiple JSON objects that map tags, tag values, and category values, the order of the JSON objects determines which tag and value to set on the column if there is a conflict. Specify the JSON objects in the desired assignment order from left to right, or top to bottom if you are formatting JSON.

Tip

Each object in the column_tag_map field has only has one required key: tag_name. If you omit the tag_value and semantic_categories keys, the user-defined tag gets applied to every column to which the SEMANTIC_CATEGORY system tag is applied, and the value of the user-defined tag will match the value of the SEMANTIC_CATEGORY tag for a given column.

If there is a conflict with a manually assigned tag and a tag applied by automatic classification, an error occurs. For information about tracking these errors, see Troubleshooting.

Implementing automatic custom classification

Snowflake lets you define custom classifiers that use custom logic to identify and classify sensitive data. For example, you can create a custom classifier that uses a regular expression to identify ICD-10 codes and classify them as belonging to the semantic category ICD_10_CODES.

After you’ve created a custom classifier, you can add it to the classification profile so that Snowflake automatically classifies data based on its logic. You can add the custom classifier when creating the classification profile or by calling the <classification_profile_name>!SET_CUSTOM_CLASSIFIERS method.

Adding both custom classifiers and a tag map in your classification profile provides a powerful governance solution. It allows you to automatically classify data based on your knowledge of what is sensitive and apply a user-defined tag that you can track. If you use this user-defined tag to implement tag-based masking, your domain-specific sensitive data is automatically protected by a masking policy as data is added to a schema.

Important

Automatic classification stores the definition of a custom classifier, not a reference. If you change the custom classifier, you must use the SET_CUSTOM_CLASSIFIERS method to update the classification profile with the new definition.

View results of automatic classification

You can view the results of automatic classification in the following ways:

  • Call the SYSTEM$GET_CLASSIFICATION_RESULT stored procedure. For example:

    CALL SYSTEM$GET_CLASSIFICATION_RESULT('mydb.sch.t1');
    
    Copy

    You cannot return results until the classification process completes. The automatic classification process does not start until one hour after setting the classification profile on the schema.

  • Use a role that is granted the SNOWFLAKE.GOVERNANCE_VIEWER database role to query the DATA_CLASSIFICATION_LATEST view. For example:

    SELECT * FROM snowflake.account_usage.data_classification_latest;
    
    Copy

    Results might not appear until three hours after classification completes.

Limitations

  • Classification profiles cannot be set on a reader account.

  • You cannot automatically classify views.

  • Only one classification profile can be set on a schema.

  • The same classification profile cannot be set on more than 10,000 schemas.

  • A maximum of 100 million tables can be classified in a schema.

  • You cannot automatically classify a table if it has any of the following characteristics:

    • Has more than 10,000 columns.

    • Has a column with a name that has more than 255 characters.

    • Has a column with a name that includes the $ character.

    • Is from a share.

Access control

This section describes the privileges and roles that let you work with classification profiles and enable automatic sensitive data classification.

Task

Required privileges/roles

Notes

Create a classification profile

SNOWFLAKE.CLASSIFICATION_ADMIN database role

For information about granting this database role to other roles, see Using SNOWFLAKE database roles.

CREATE SNOWFLAKE.DATA_PRIVACY.CLASSIFICATION_PROFILE on schema

You need this privilege on the schema where you want to create the classification profile instance.

Set the classification profile on a schema

One of the following:

  • EXECUTE AUTO CLASSIFICATION on account

  • EXECUTE AUTO CLASSIFICATION on schema

By default, the owner of the schema has the EXECUTE AUTO CLASSIFICATION privilege on it.

Any privilege on Database

You need at least one privilege on the database that contains the schema on which you are setting the classification profile.

Any privilege on Schema

You need at least one privilege on the schema that contains the table that you want to automatically classify. The EXECUTE AUTO CLASSIFICATION privilege meets this requirement.

One of the following:

  • OWNERSHIP on classification profile instance.

  • <classification_profile>!PRIVACY_USER instance role on the classification profile.

For information about granting the PRIVACY_USER instance role to other roles, see Instance roles.

APPLY TAG on Account

Call methods on a classification profile instance

<classification_profile>!PRIVACY_USER instance role

For information about granting this instance role to other roles, see Instance roles.

List classification profiles

<classification_profile>!PRIVACY_USER instance role

Drop classification profiles

OWNERSHIP on classification profile instance

For an example of granting these privileges and database roles to the role of a data engineer, see Basic example: Automatically classifying tables in a schema.

Cost of automatically classifying sensitive data

Automatic sensitive data classification consumes credits as it uses serverless compute resources to classify tables in the schema. For more information about pricing for this consumption, see Table 5 in the Snowflake Service Consumption Table.

You can query views in the ACCOUNT_USAGE and ORGANIZATION_USAGE schemas to determine how much was spent on automatically classifying sensitive data. To monitor credit consumption, query the following views:

METERING_HISTORY view (ACCOUNT_USAGE)

Lets you retrieve the hourly cost of automatic classification by focusing on SENSITIVE_DATA_CLASSIFICATION in the SERVICE_TYPE column. For example:

SELECT
  service_type,
  start_time,
  end_time,
  entity_id,
  name,
  credits_used_compute,
  credits_used_cloud_services,
  credits_used,
  budget_id
  FROM snowflake.account_usage.metering_history
  WHERE service_type = 'SENSITIVE_DATA_CLASSIFICATION';
Copy
METERING_DAILY_HISTORY view (ACCOUNT_USAGE and ORGANIZATION_USAGE)

Lets you retrieve the daily cost of automatic classification by focusing on SENSITIVE_DATA_CLASSIFICATION in the SERVICE_TYPE column. For example:

SELECT
  service_type,
  usage_date,
  credits_used_compute,
  credits_used_cloud_services,
  credits_used
  FROM snowflake.account_usage.metering_daily_history
  WHERE service_type = 'SENSITIVE_DATA_CLASSIFICATION';
Copy
USAGE_IN_CURRENCY_DAILY (ORGANIZATION_USAGE)

Lets you retrieve the daily cost of automatic classification by focusing on SENSITIVE_DATA_CLASSIFICATION in the SERVICE_TYPE column. Use this view to determine the cost in currency, not credits.

Examples

Basic example: Automatically classifying tables in a schema

Complete these steps to automatically classify a table in the schema:

  1. As an administrator, give the data engineer the roles and privileges they need to automatically classify tables in a schema.

    USE ROLE ACCOUNTADMIN;
    
    GRANT USAGE ON DATABASE mydb TO ROLE data_engineer;
    GRANT EXECUTE AUTO CLASSIFICATION ON SCHEMA mydb.sch TO ROLE data_engineer;
    
    GRANT DATABASE ROLE SNOWFLAKE.CLASSIFICATION_ADMIN TO ROLE data_engineer;
    GRANT CREATE SNOWFLAKE.DATA_PRIVACY.CLASSIFICATION_PROFILE ON SCHEMA mydb.sch TO ROLE data_engineer;
    
    GRANT APPLY TAG ON ACCOUNT TO ROLE data_engineer;
    
    Copy
  2. Switch to the data engineer role:

    USE ROLE data_engineer;
    
    Copy
  3. Create the classification profile as an instance of the CLASSIFICATION_PROFILE class:

    CREATE OR REPLACE SNOWFLAKE.DATA_PRIVACY.CLASSIFICATION_PROFILE
      my_classification_profile(
        {
          'minimum_object_age_for_classification_days': 0,
          'maximum_classification_validity_days': 30,
          'auto_tag': true
        });
    
    Copy
  4. Call the DESCRIBE method on the instance to confirm its properties:

    SELECT my_classification_profile!DESCRIBE();
    
    Copy
  5. Set the classification profile instance on the schema, which starts the background process of monitoring tables in the schema and automatically classifying them for sensitive data.

    ALTER SCHEMA mydb.sch
     SET CLASSIFICATION_PROFILE = 'mydb.sch.my_classification_profile';
    
    Copy

    Note

    There is a one-hour delay between setting the classification profile on the schema and Snowflake beginning to classify the schema.

  6. After waiting one hour, call the SYSTEM$GET_CLASSIFICATION_RESULT stored procedure to obtain the results of the automatic classification.

    CALL SYSTEM$GET_CLASSIFICATION_RESULT('mydb.sch.t1');
    
    Copy
  7. If you no longer need to automatically classify tables in a schema, unset the classification profile from the schema:

    ALTER SCHEMA mydb.sch UNSET CLASSIFICATION_PROFILE;
    
    Copy
  8. Drop any classification profiles that are not needed using the DROP CLASSIFICATION_PROFILE command.

Example: Using a tag map and custom classifiers

  1. As an administrator, give the data engineer the roles and privileges they need to automatically classify tables in a schema and set tags on columns.

  2. Create the classification profile.

    CREATE OR REPLACE SNOWFLAKE.DATA_PRIVACY.CLASSIFICATION_PROFILE
      my_classification_profile(
        {
          'minimum_object_age_for_classification_days': 0,
          'maximum_classification_validity_days': 30,
          'auto_tag': true
        });
    
    Copy
  3. Call the SET_TAG_MAP method on the instance to add a tag map to the classification profile. This allows custom tags to be automatically applied on columns that contain sensitive data.

    CALL my_classification_profile!SET_TAG_MAP(
      {'column_tag_map':[
        {
          'tag_name':'my_db.sch1.pii',
          'tag_value':'sensitive',
          'semantic_categories':['NAME']
        }]});
    
    Copy

    Alternatively, you could have added this tag map when you created the classification profile.

  4. Call the SET_CUSTOM_CLASSIFIERS method to add custom classifiers to the classification profile. This allows sensitive data to be automatically classified with user-defined semantic and privacy categories.

    CALL my_classification_profile!set_custom_classifiers(
      {
        'medical_codes': medical_codes!list(),
        'finance_codes': finance_codes!list()
      });
    
    Copy

    Alternatively, you could have added the custom classifiers when you created the classification profile.

  5. Call the DESCRIBE method on the instance to confirm that the tag map and custom classifiers have been added to the classification profile.

    SELECT my_classification_profile!DESCRIBE();
    
    Copy
  6. Set the classification profile instance on the schema.

    ALTER SCHEMA mydb.sch
     SET CLASSIFICATION_PROFILE = 'mydb.sch.my_classification_profile';
    
    Copy
  7. Attach a masking policy to the tag_db.sch.pii tag to enable tag-based masking.

    ALTER TAG tag_db.sch.pii SET MASKING POLICY pii_mask;
    
    Copy

Example: Testing a classification profile before enabling automatic classification

  1. As an administrator, give the data engineer the roles and privileges they need to automatically classify tables in a schema and set tags on columns.

  2. Create the classification profile with a tag map and custom classifiers:

    CREATE OR REPLACE SNOWFLAKE.DATA_PRIVACY.CLASSIFICATION_PROFILE my_classification_profile(
      {
        'minimum_object_age_for_classification_days':0,
        'auto_tag':true,
        'tag_map': {
          'column_tag_map':[
            {
              'tag_name':'tag_db.sch.pii',
              'tag_value':'highly sensitive',
              'semantic_categories':['NAME','NATIONAL_IDENTIFIER']
            },
            {
              'tag_name':'tag_db.sch.pii',
              'tag_value':'sensitive',
              'semantic_categories':['EMAIL','MEDICAL_CODE']
            }
          ]
        },
        'custom_classifiers': {
          'medical_codes': medical_codes!list(),
          'finance_codes': finance_codes!list()
        }
      }
    );
    
    Copy
  3. Call the SYSTEM$CLASSIFY stored procedure to test the tag mappings on the table1 table before enabling automatic classification.

    CALL SYSTEM$CLASSIFY(
     'db.sch.table1',
     'db.sch.my_classification_profile'
    );
    
    Copy

    The tags key in the output contains the details about whether the tag was set (true if set, false otherwise), the name of the tag that was set, and the value of the tag:

    {
      "classification_profile_config": {
        "classification_profile_name": "db.schema.my_classification_profile"
      },
      "classification_result": {
        "EMAIL": {
          "alternates": [],
          "recommendation": {
            "confidence": "HIGH",
            "coverage": 1,
            "details": [],
            "privacy_category": "IDENTIFIER",
            "semantic_category": "EMAIL",
            "tags": [
              {
                "tag_applied": true,
                "tag_name": "snowflake.core.semantic_category",
                "tag_value": "EMAIL"
              },
              {
                "tag_applied": true,
                "tag_name": "snowflake.core.privacy_category",
                "tag_value": "IDENTIFIER"
              },
              {
                "tag_applied": true,
                "tag_name": "tag_db.sch.pii",
                "tag_value": "sensitive"
              }
            ]
          },
          "valid_value_ratio": 1
        },
        "FIRST_NAME": {
          "alternates": [],
          "recommendation": {
            "confidence": "HIGH",
            "coverage": 1,
            "details": [],
            "privacy_category": "IDENTIFIER",
            "semantic_category": "NAME",
            "tags": [
              {
                "tag_applied": true,
                "tag_name": "snowflake.core.semantic_category",
                "tag_value": "NAME"
              },
              {
                "tag_applied": true,
                "tag_name": "snowflake.core.privacy_category",
                "tag_value": "IDENTIFIER"
              },
              {
                "tag_applied": true,
                "tag_name": "tag_db.sch.pii",
                "tag_value": "highly sensitive"
              }
            ]
          },
          "valid_value_ratio": 1
        }
      }
    }
    
  4. Having verified that automatic classification based on the classification profile will have the desired result, set the classification profile instance on the schema.

    ALTER SCHEMA mydb.sch
     SET CLASSIFICATION_PROFILE = 'mydb.sch.my_classification_profile';
    
    Copy

Troubleshooting

By default, Snowflake uses the user event table to log events related to the automatic classification of sensitive data. If you want prevent classification events from being logged, set the ENABLE_AUTOMATIC_SENSITIVE_DATA_CLASSIFICATION_LOG account parameter to FALSE.

You can use the following query to access error messages from the event table:

SELECT
  record_type,
  record:severity_text::string log_level,
  parse_json(value) error_message
  FROM log_db.log_schema.log_table
  WHERE record_type='LOG' and scope:name ='snow.automatic_sensitive_data_classification'
  ORDER BY log_level;
Copy

The following are possible error messages, where the output is truncated to contain the "failure_reason" key and its value (the error message):

Error

"failure_reason":"NO_TAGGING_PRIVILEGE"

Cause

The role that was used for automatic classification does not have the correct privileges to set tags.

Solution

Grant the necessary privileges to the role used for automatic classification. For more information, see Tag privileges.

Error

"failure_reason":"MANUALLY_APPLIED_VALUE_PRESENT"

Cause

Another tag is manually set on the column.

Solution

Determine whether you want to keep the tag that was manually set on the column. If not, unset the tag before classifying the table using automatic classification or the SYSTEM$CLASSIFY stored procedure.

Error

"failure_reason":"TAG_NOT_ACCESSIBLE_OR_AUTHORIZED"

Cause

The role that was used for classification cannot access the tag.

Solution

  • If the tag does not exist, create the tag.

  • If the tag exists, grant privileges on the tag, or the database and schema that contains the tag, to the role that was used to classify the schema.

For more information about event table messages, see Viewing log messages.