Introduction to Classification¶
Classification is a process that analyzes and categorizes information stored in the columns in database tables and views.
Once the process completes, classification utilizes object tags to label the data, which can then be used to facilitate analysis of and compliance with privacy regulations.
In this Topic:
What is Classification?¶
Classification enables answering questions about the data stored in tables and views, such as:
Does the table/view contain PII (Personally Identifiable Information) or sensitive data?
Where is the data stored and how long has it been stored?
How can the data be protected from exposure while still deriving insights?
The classification process samples all the supported columns in a table or view and uses the column names and values to classify the data into system categories provided by Snowflake. The categories can be assigned to the columns as tags, which can be set manually or using the provided stored procedure.
Classification Use Cases¶
Once the tags produced by classification have been assigned to a table, view, or column, they can be used to enable a variety of data governance, sharing, and privacy use cases, including:
- PII Classification
You can use classification to identify PII (Personally Identifiable Information) in your data to mitigate risk and meet compliance.
- Data Access
You can use classification tags to configure security controls to prevent unauthorized access to personal data.
- Policy Management
You can use classification tags to determine how to set masking policies to protect the privacy of the data.
You can use classification to streamline anonymization of personal data. Anonymization relies on classification privacy categories to protect the identity of the associated subjects while still making their data available for analysis.
Supported Table / View Types and Column Data Types¶
Snowflake supports classifying data stored in all types of tables and views, including:
Classification can be performed on table/view columns of all supported data types except for the following data types:
If a table/view contains columns that are not of a supported data type or the column contains all NULL values, the classification process ignores the columns and does not include them in the output.
If your data represents NULL values with a value other than NULL, the accuracy of the classification results may be impacted.
The classification process requires compute resources, which are provided by the virtual warehouse that is in use and running when classification is performed.
The amount of time needed to classify the data in a table/view (and, therefore, the number of credits consumed by the warehouse) is a function of the amount of data to be classified.
In particular, if a table/view has a large number of columns that support classification, the processing time can be impacted. However, as a general rule, the processing speed scales linearly with the warehouse size. In other words, each size increase for a warehouse (e.g. X-small to Small) typically reduces the processing time by half.
Snowflake utilizes two category types for classifying data in table/view columns:
A semantic category identifies a column as storing personal attributes. Some of the semantic categories supported by Snowflake include:
Phone number (currently US numbers only)
For a complete list of the semantic categories supported in the current release, see Category Tag Values and Mappings. Additional semantic categories will be added in future releases.
If a column is determined to have a semantic category, the column is further classified according to one of the following privacy categories:
Also known as direct identifiers, these attributes uniquely identify an individual (e.g. name, social security number, or phone number).
Also known as indirect identifiers, these attributes, when combined with other attributes, can be used to uniquely identify an individual (e.g. age + gender + zip).
Personal attributes that are not identifying, but are information that individuals do not want disclosed for privacy reasons (e.g. salary or medical/healthcare status).
Multiple semantic categories from all three privacy categories may be considered “Sensitive Personal Data”, “Special Categories of Data”, or similar terms under laws and regulations, and may require additional protections or controls.
Currently, classification does not tag data as both sensitive and identifying. In other words, classification is an “either-or” operation, which you must consider when creating rules to govern access to data identified as sensitive.
Semantic Category Probabilities and Alternates¶
In addition to identifying the semantic category and privacy category for a column, Snowflake also returns the following information about the semantic category for the column:
The probability that the classification process derived the correct semantic category.
A list of alternate semantic categories with which the column can be tagged (if the probability is below the
0.80threshold and the process identified other possible semantic categories with a probability greater than
For more details, see the EXTRACT_SEMANTIC_CATEGORIES function.