Classify sensitive data automatically¶
Automatic sensitive data classification is a serverless feature that enables the automatic detection and tagging of sensitive data. The feature continuously monitors tables within a specific schema and classifies their columns using native and custom classification categories.
Automatic sensitive data classification lets data engineers and stewards do the following:
Demonstrate how automatically classifying tables meets internal governance and compliance needs.
Ensure sensitive data is properly tagged.
Ensure the right access controls are in place to protect the sensitive data.
Get started¶
The basic workflow to automatically classify sensitive data consists of the following:
Create a classification profile that controls how often sensitive data in a schema is automatically classified, including whether system tags should be automatically applied after classification.
Optionally, use the classification profile to map user-defined tags to system tags so a column with sensitive data can be associated with a user-defined tag based on its classification.
Optionally, add a custom classifier to the classification profile so sensitive data can be automatically classified with user-defined semantic and privacy categories.
Set the classification profile on a schema so that tables in the schema get automatically classified.
For end-to-end examples of this workflow, see Examples.
About classification profiles¶
A data engineer creates a classification profile by creating an instance of the CLASSIFICATION_PROFILE class to define the criteria that are used to automatically classify tables in a schema. This criteria includes:
How long a table should exist before automatically classifying it.
How long before previously classified tables should be reclassified.
Whether system and custom tags are automatically set on columns after the classification. You can decide whether you want Snowflake to automatically apply suggested tags or prefer to review proposed tag assignments, then apply them yourself.
A mapping between system classification tags and user-defined object tags so the user-defined tags can be applied automatically.
When the data engineer assigns the classification profile to a schema, sensitive data in the tables of the schema are automatically classified on the schedule defined by the profile. A data engineer can assign the same classification profile to multiple schemas, or can create multiple classification profiles if there is a need to set different classification criteria for different schemas.
The process of automatically classifying data requires access to the raw data in the table. The raw data includes tables that have a masking policy assigned to a column. However, Snowflake preserves the intention of regulating access to protected data by using an internal role to automatically classify data. The internal role can access data protected by a masking policy, but this role is not accessible to users.
For an example of using the CREATE CLASSIFICATION_PROFILE command to create a classification profile, see Examples.
About tag mapping¶
You can use the classification profile to map SEMANTIC_CATEGORY system tags to one or more user-defined tags. This tag mapping allows a column with sensitive data to be automatically assigned a user-defined tag based on its classification. The tag map can be added while creating the classification profile or later by calling the <classification_profile_name>!SET_TAG_MAP method.
Because user-defined object tags can have a masking policy associated with them, you can use a tag map to enable automatic tag-based masking. If you choose to automatically apply tags after classification, you can automate the entire process of protecting columns with a masking policy based on the classification of data. As new data is added to a schema, the tag-based masking policies will be automatically assigned to the columns that contain sensitive data.
Regardless of whether you are defining the tag map while creating the classification profile or after, the contents of the map are specified
as a JSON object. This JSON object contains the 'column_tag_map'
key, which is an array of objects that specify a user-defined tag,
the string value of that tag, and the semantic categories to which the tag is being mapped. After the tag map is associated with a
classification profile and you automatically classify tables in a schema, the tag is assigned to the columns that correspond to the
semantic categories.
The following is an example of a tag map:
'tag_map': {
'column_tag_map': [
{
'tag_name':'tag_db.sch.pii',
'tag_value':'Highly Confidential',
'semantic_categories':[
'NAME',
'NATIONAL_IDENTIFIER'
]
},
{
'tag_name': 'tag_db.sch.pii',
'tag_value':'Confidential',
'semantic_categories': [
'EMAIL'
]
}
]
}
Based on this mapping, if you have a column of email addresses and the classification process determines that the column contains these
addresses, the tag_db.sch.pii = 'Confidential'
tag is set on the column containing the email addresses.
If your tag map includes multiple JSON objects that map tags, tag values, and category values, the order of the JSON objects determines which tag and value to set on the column if there is a conflict. Specify the JSON objects in the desired assignment order from left to right, or top to bottom if you are formatting JSON.
Tip
Each object in the column_tag_map
field has only has one required key: tag_name
. If you omit the tag_value
and
semantic_categories
keys, the user-defined tag gets applied to every column to which the SEMANTIC_CATEGORY system tag is applied,
and the value of the user-defined tag will match the value of the SEMANTIC_CATEGORY tag for a given column.
If there is a conflict with a manually assigned tag and a tag applied by automatic classification, an error occurs. For information about tracking these errors, see Troubleshooting.
Implementing automatic custom classification¶
Snowflake lets you define custom classifiers that use custom logic to identify and classify sensitive
data. For example, you can create a custom classifier that uses a regular expression to identify ICD-10 codes and classify them as belonging
to the semantic category ICD_10_CODES
.
After you’ve created a custom classifier, you can add it to the classification profile so that Snowflake automatically classifies data based on its logic. You can add the custom classifier when creating the classification profile or by calling the <classification_profile_name>!SET_CUSTOM_CLASSIFIERS method.
Adding both custom classifiers and a tag map in your classification profile provides a powerful governance solution. It allows you to automatically classify data based on your knowledge of what is sensitive and apply a user-defined tag that you can track. If you use this user-defined tag to implement tag-based masking, your domain-specific sensitive data is automatically protected by a masking policy as data is added to a schema.
Important
Automatic classification stores the definition of a custom classifier, not a reference. If you change the custom classifier, you must use the SET_CUSTOM_CLASSIFIERS method to update the classification profile with the new definition.
View results of automatic classification¶
You can view the results of automatic classification in the following ways:
Call the SYSTEM$GET_CLASSIFICATION_RESULT stored procedure. For example:
CALL SYSTEM$GET_CLASSIFICATION_RESULT('mydb.sch.t1');
You cannot return results until the classification process completes. The automatic classification process does not start until one hour after setting the classification profile on the schema.
Use a role that is granted the SNOWFLAKE.GOVERNANCE_VIEWER database role to query the DATA_CLASSIFICATION_LATEST view. For example:
SELECT * FROM snowflake.account_usage.data_classification_latest;
Results might not appear until three hours after classification completes.
Limitations¶
Classification profiles cannot be set on a reader account.
You cannot automatically classify views.
Only one classification profile can be set on a schema.
The same classification profile cannot be set on more than 10,000 schemas.
A maximum of 100 million tables can be classified in a schema.
You cannot automatically classify a table if it has any of the following characteristics:
Has more than 10,000 columns.
Has a column with a name that has more than 255 characters.
Has a column with a name that includes the
$
character.Is from a share.
Access control¶
This section describes the privileges and roles that let you work with classification profiles and enable automatic sensitive data classification.
Task |
Required privileges/roles |
Notes |
---|---|---|
Create a classification profile |
SNOWFLAKE.CLASSIFICATION_ADMIN database role |
For information about granting this database role to other roles, see Using SNOWFLAKE database roles. |
CREATE SNOWFLAKE.DATA_PRIVACY.CLASSIFICATION_PROFILE on schema |
You need this privilege on the schema where you want to create the classification profile instance. |
|
Set the classification profile on a schema |
One of the following:
|
By default, the owner of the schema has the EXECUTE AUTO CLASSIFICATION privilege on it. |
Any privilege on Database |
You need at least one privilege on the database that contains the schema on which you are setting the classification profile. |
|
Any privilege on Schema |
You need at least one privilege on the schema that contains the table that you want to automatically classify. The EXECUTE AUTO CLASSIFICATION privilege meets this requirement. |
|
One of the following:
|
For information about granting the PRIVACY_USER instance role to other roles, see Instance roles. |
|
APPLY TAG on Account |
||
Call methods on a classification profile instance |
<classification_profile>!PRIVACY_USER instance role |
For information about granting this instance role to other roles, see Instance roles. |
List classification profiles |
<classification_profile>!PRIVACY_USER instance role |
|
Drop classification profiles |
OWNERSHIP on classification profile instance |
For an example of granting these privileges and database roles to the role of a data engineer, see Basic example: Automatically classifying tables in a schema.
Cost of automatically classifying sensitive data¶
Automatic sensitive data classification consumes credits as it uses serverless compute resources to classify tables in the schema. For more information about pricing for this consumption, see Table 5 in the Snowflake Service Consumption Table.
You can query views in the ACCOUNT_USAGE and ORGANIZATION_USAGE schemas to determine how much was spent on automatically classifying sensitive data. To monitor credit consumption, query the following views:
- METERING_HISTORY view (ACCOUNT_USAGE)
Lets you retrieve the hourly cost of automatic classification by focusing on
SENSITIVE_DATA_CLASSIFICATION
in theSERVICE_TYPE
column. For example:SELECT service_type, start_time, end_time, entity_id, name, credits_used_compute, credits_used_cloud_services, credits_used, budget_id FROM snowflake.account_usage.metering_history WHERE service_type = 'SENSITIVE_DATA_CLASSIFICATION';
- METERING_DAILY_HISTORY view (ACCOUNT_USAGE and ORGANIZATION_USAGE)
Lets you retrieve the daily cost of automatic classification by focusing on
SENSITIVE_DATA_CLASSIFICATION
in theSERVICE_TYPE
column. For example:SELECT service_type, usage_date, credits_used_compute, credits_used_cloud_services, credits_used FROM snowflake.account_usage.metering_daily_history WHERE service_type = 'SENSITIVE_DATA_CLASSIFICATION';
- USAGE_IN_CURRENCY_DAILY (ORGANIZATION_USAGE)
Lets you retrieve the daily cost of automatic classification by focusing on
SENSITIVE_DATA_CLASSIFICATION
in theSERVICE_TYPE
column. Use this view to determine the cost in currency, not credits.
Examples¶
Basic example: Automatically classifying tables in a schema¶
Complete these steps to automatically classify a table in the schema:
As an administrator, give the data engineer the roles and privileges they need to automatically classify tables in a schema.
USE ROLE ACCOUNTADMIN; GRANT USAGE ON DATABASE mydb TO ROLE data_engineer; GRANT EXECUTE AUTO CLASSIFICATION ON SCHEMA mydb.sch TO ROLE data_engineer; GRANT DATABASE ROLE SNOWFLAKE.CLASSIFICATION_ADMIN TO ROLE data_engineer; GRANT CREATE SNOWFLAKE.DATA_PRIVACY.CLASSIFICATION_PROFILE ON SCHEMA mydb.sch TO ROLE data_engineer; GRANT APPLY TAG ON ACCOUNT TO ROLE data_engineer;
Switch to the data engineer role:
USE ROLE data_engineer;
Create the classification profile as an instance of the CLASSIFICATION_PROFILE class:
CREATE OR REPLACE SNOWFLAKE.DATA_PRIVACY.CLASSIFICATION_PROFILE my_classification_profile( { 'minimum_object_age_for_classification_days': 0, 'maximum_classification_validity_days': 30, 'auto_tag': true });
Call the DESCRIBE method on the instance to confirm its properties:
SELECT my_classification_profile!DESCRIBE();
Set the classification profile instance on the schema, which starts the background process of monitoring tables in the schema and automatically classifying them for sensitive data.
ALTER SCHEMA mydb.sch SET CLASSIFICATION_PROFILE = 'mydb.sch.my_classification_profile';
Note
There is a one-hour delay between setting the classification profile on the schema and Snowflake beginning to classify the schema.
After waiting one hour, call the SYSTEM$GET_CLASSIFICATION_RESULT stored procedure to obtain the results of the automatic classification.
CALL SYSTEM$GET_CLASSIFICATION_RESULT('mydb.sch.t1');
If you no longer need to automatically classify tables in a schema, unset the classification profile from the schema:
ALTER SCHEMA mydb.sch UNSET CLASSIFICATION_PROFILE;
Drop any classification profiles that are not needed using the DROP CLASSIFICATION_PROFILE command.
Example: Using a tag map and custom classifiers¶
As an administrator, give the data engineer the roles and privileges they need to automatically classify tables in a schema and set tags on columns.
Create the classification profile.
CREATE OR REPLACE SNOWFLAKE.DATA_PRIVACY.CLASSIFICATION_PROFILE my_classification_profile( { 'minimum_object_age_for_classification_days': 0, 'maximum_classification_validity_days': 30, 'auto_tag': true });
Call the SET_TAG_MAP method on the instance to add a tag map to the classification profile. This allows custom tags to be automatically applied on columns that contain sensitive data.
CALL my_classification_profile!SET_TAG_MAP( {'column_tag_map':[ { 'tag_name':'my_db.sch1.pii', 'tag_value':'sensitive', 'semantic_categories':['NAME'] }]});
Alternatively, you could have added this tag map when you created the classification profile.
Call the SET_CUSTOM_CLASSIFIERS method to add custom classifiers to the classification profile. This allows sensitive data to be automatically classified with user-defined semantic and privacy categories.
CALL my_classification_profile!set_custom_classifiers( { 'medical_codes': medical_codes!list(), 'finance_codes': finance_codes!list() });
Alternatively, you could have added the custom classifiers when you created the classification profile.
Call the DESCRIBE method on the instance to confirm that the tag map and custom classifiers have been added to the classification profile.
SELECT my_classification_profile!DESCRIBE();
Set the classification profile instance on the schema.
ALTER SCHEMA mydb.sch SET CLASSIFICATION_PROFILE = 'mydb.sch.my_classification_profile';
Attach a masking policy to the
tag_db.sch.pii
tag to enable tag-based masking.ALTER TAG tag_db.sch.pii SET MASKING POLICY pii_mask;
Example: Testing a classification profile before enabling automatic classification¶
As an administrator, give the data engineer the roles and privileges they need to automatically classify tables in a schema and set tags on columns.
Create the classification profile with a tag map and custom classifiers:
CREATE OR REPLACE SNOWFLAKE.DATA_PRIVACY.CLASSIFICATION_PROFILE my_classification_profile( { 'minimum_object_age_for_classification_days':0, 'auto_tag':true, 'tag_map': { 'column_tag_map':[ { 'tag_name':'tag_db.sch.pii', 'tag_value':'highly sensitive', 'semantic_categories':['NAME','NATIONAL_IDENTIFIER'] }, { 'tag_name':'tag_db.sch.pii', 'tag_value':'sensitive', 'semantic_categories':['EMAIL','MEDICAL_CODE'] } ] }, 'custom_classifiers': { 'medical_codes': medical_codes!list(), 'finance_codes': finance_codes!list() } } );
Call the SYSTEM$CLASSIFY stored procedure to test the tag mappings on the
table1
table before enabling automatic classification.CALL SYSTEM$CLASSIFY( 'db.sch.table1', 'db.sch.my_classification_profile' );
The
tags
key in the output contains the details about whether the tag was set (true
if set,false
otherwise), the name of the tag that was set, and the value of the tag:{ "classification_profile_config": { "classification_profile_name": "db.schema.my_classification_profile" }, "classification_result": { "EMAIL": { "alternates": [], "recommendation": { "confidence": "HIGH", "coverage": 1, "details": [], "privacy_category": "IDENTIFIER", "semantic_category": "EMAIL", "tags": [ { "tag_applied": true, "tag_name": "snowflake.core.semantic_category", "tag_value": "EMAIL" }, { "tag_applied": true, "tag_name": "snowflake.core.privacy_category", "tag_value": "IDENTIFIER" }, { "tag_applied": true, "tag_name": "tag_db.sch.pii", "tag_value": "sensitive" } ] }, "valid_value_ratio": 1 }, "FIRST_NAME": { "alternates": [], "recommendation": { "confidence": "HIGH", "coverage": 1, "details": [], "privacy_category": "IDENTIFIER", "semantic_category": "NAME", "tags": [ { "tag_applied": true, "tag_name": "snowflake.core.semantic_category", "tag_value": "NAME" }, { "tag_applied": true, "tag_name": "snowflake.core.privacy_category", "tag_value": "IDENTIFIER" }, { "tag_applied": true, "tag_name": "tag_db.sch.pii", "tag_value": "highly sensitive" } ] }, "valid_value_ratio": 1 } } }
Having verified that automatic classification based on the classification profile will have the desired result, set the classification profile instance on the schema.
ALTER SCHEMA mydb.sch SET CLASSIFICATION_PROFILE = 'mydb.sch.my_classification_profile';
Troubleshooting¶
By default, Snowflake uses the user event table to log events related to the automatic classification of sensitive data. If you want prevent classification events from being logged, set the ENABLE_AUTOMATIC_SENSITIVE_DATA_CLASSIFICATION_LOG account parameter to FALSE.
You can use the following query to access error messages from the event table:
SELECT
record_type,
record:severity_text::string log_level,
parse_json(value) error_message
FROM log_db.log_schema.log_table
WHERE record_type='LOG' and scope:name ='snow.automatic_sensitive_data_classification'
ORDER BY log_level;
The following are possible error messages, where the output is truncated to contain the "failure_reason"
key and its value
(the error message):
Error |
"failure_reason":"NO_TAGGING_PRIVILEGE"
|
---|---|
Cause |
The role that was used for automatic classification does not have the correct privileges to set tags. |
Solution |
Grant the necessary privileges to the role used for automatic classification. For more information, see Tag privileges. |
Error |
"failure_reason":"MANUALLY_APPLIED_VALUE_PRESENT"
|
---|---|
Cause |
Another tag is manually set on the column. |
Solution |
Determine whether you want to keep the tag that was manually set on the column. If not, unset the tag before classifying the table using automatic classification or the SYSTEM$CLASSIFY stored procedure. |
Error |
"failure_reason":"TAG_NOT_ACCESSIBLE_OR_AUTHORIZED"
|
---|---|
Cause |
The role that was used for classification cannot access the tag. |
Solution |
|
For more information about event table messages, see Viewing log messages.