Introduction to Classification¶
This topic provides information on how classification works.
For information on how to use custom classifiers, see Custom Data Classification.
Overview¶
Classification is a multi-step process that associates Snowflake-defined system tags to columns by analyzing the fields and metadata for personal data; this data can be tracked by a data engineer using SQL and Snowsight. A data engineer can classify columns in a table to determine whether the column contains certain kinds of data that need to be tracked or protected, such a unique identifier (passport or bank account data), a quasi-identifier (the city in which the individual lives), or a sensitive value (the salary of an individual).
By tracking the data with a system tag and protecting the data by using a masking or row access policy, the data engineer can improve the governance posture associated with the data. The overall result of the classification and data protection steps is to facilitate compliance with data privacy regulations.
You can classify a single table or tables in a schema. Snowflake provides predefined system tags to enable you to classify and tag columns, or you can use custom classifiers to define your own semantic category based on your knowledge of your data. You can also choose an approach the uses Snowflake system tags and custom classifiers depending on the governance posture that you wish to adopt.
Classification provides the following benefits to data privacy and data governance administrators:
- Data access:
The results of classifying column data can inform identity and access management administrators to evaluate and maintain their Snowflake role hierarchies to ensure the Snowflake roles have the appropriate access to sensitive or PII data.
- Data sharing:
The classification process can help to identify and confirm the storage location of PII data. Subsequently, a data sharing provider can use the classification results to determine whether to share data and how to make the PII data available to a data sharing consumer.
- Policy application:
The usage of columns containing PII data, such as referencing columns in base tables to create a view or materialized view, can help to determine the best approach to protect the data with either a masking policy or a row access policy.
Supported objects and data types¶
Snowflake supports classifying data stored in all types of tables and views, including external tables, materialized views, and secure views.
You can classify table and view columns for all supported data types except for the following data types:
ARRAY
BINARY
GEOGRAPHY
OBJECT
VARIANT
Note that you can classify a column with the VARIANT data type when the column data type can be cast to a NUMBER or STRING data type. Snowflake does not classify the column if the column contains JSON, XML, or other semi-structured data.
VECTOR
If a table contains columns that are not of a supported data type or the column contains all NULL values, the classification process ignores the columns and does not include them in the output.
Important
If your data represents NULL values with a value other than NULL, the accuracy of the classification results may be impacted.
Compute costs¶
The classification process requires compute resources, which are provided by the virtual warehouse that is in use and running when classification is performed.
The amount of time needed to classify the data in a table/view (and, therefore, the number of credits consumed by the warehouse) is a function of the columns to be classified.
In particular, if a table or view has a large number of columns that support classification, the processing time can be impacted. However, as a general rule, the processing speed scales linearly with the warehouse size. In other words, each size increase for a warehouse (X-small to Small) typically reduces the processing time by half.
Use the following general guidelines to select a warehouse size:
No concern for processing time: x-small warehouse.
Up to 100 columns in a table: small warehouse.
101 to 300 columns in a table: medium warehouse.
301 columns or more in a table: large warehouse.
For details, see Warehouse considerations.
Recommendations¶
To capitalize on the Classification feature and optimize your PII data tracking capabilities, do the following:
- Validation:
Query Account Usage views first:
ACCESS_HISTORY: determine the table and view objects that are accessed most frequently.
OBJECT_DEPENDENCIES: determine metadata references between two or more objects.
Use the query results to prioritize schema-level or database-level assignment of the Classification system tags.
- Column names:
Use sensible column names in your table objects and train table creators to adhere to internal table creation guidelines.
- Data types:
Use sensible data types for columns. For example, an AGE column should have the NUMBER data type.
- VARIANT:
If a column has a VARIANT data type, use the FLATTEN command on the column prior to classifying the table.
- Warehouse:
Use the proper warehouse size when classifying data. For details, refer to Compute costs (in this topic).
Manage Classification¶
Privilege reference¶
The privilege model for Data Classification enables the data privacy administrator to determine which personas can classify tables and tag columns. For example, a single role can have all of the necessary privileges, or the data privacy administrator can delegate grants to different roles to satisfy separation of duties (SoD) requirements. One example of a viable grant combination is shown in the Get started classifying data section of Use Data Classification.
As an administrator, you have different options depending on how you want to manage which roles or personas are involved. The options provide flexibility the governance posture that you wish to adopt. For example:
The table owner (the role with the OWNERSHIP privilege on the table) can classify the table and set system tags on the columns.
A custom role that has the SELECT privilege on the table and the APPLY TAG privilege on the account can classify the table and set system tags on the columns.
If you want different roles or personas to be involved with classifying and tagging columns, you could grant the SELECT privilege on the table to one role and the APPLY TAG privilege on the account to a different role.
This following table summarizes the different grant options to classify a table, set the Data Classification system tags on columns, and do both of these tasks:
Privilege or role |
Classify table(s) |
Set system tags on columns |
---|---|---|
SELECT on the table or view. |
✔ |
|
OWNERSHIP on the table. |
✔ |
✔ |
APPLY TAG on the account. |
✔ |
|
ACCOUNTADMIN role. |
✔ |
|
OWNERSHIP on the database or schema. |
Important
Classifying tables requires a running warehouse. The role that is used to classify a table must have the USAGE privilege on a warehouse at a minimum.
You can grant the SNOWFLAKE.GOVERNANCE_VIEWER database role to an account role to enable users with that account role to query the DATA_CLASSIFICATION_LATEST view to see the most recent results of a classified table.