Introduction to Classification

Classification is a process that analyzes and categorizes information stored in the columns in database tables and views.

Once the process completes, classification utilizes object tags to label the data, which can then be used to facilitate analysis of and compliance with privacy regulations.

In this Topic:

What is Classification?

Classification enables answering questions about the data stored in tables and views, such as:

  • Does the table/view contain PII (Personally Identifiable Information) or sensitive data?

  • Where is the data stored and how long has it been stored?

  • How can the data be protected from exposure while still deriving insights?

The classification process samples all the supported columns in a table or view and uses the column names and values to classify the data into system categories provided by Snowflake. The categories can be assigned to the columns as tags, which can be set manually or using the provided stored procedure.

Classification Use Cases

Once the tags produced by classification have been assigned to a table, view, or column, they can be used to enable a variety of data governance, sharing, and privacy use cases, including:

PII Classification

You can use classification to identify PII (Personally Identifiable Information) in your data to mitigate risk and meet compliance.

Data Access

You can use classification tags to configure security controls to prevent unauthorized access to personal data.

Policy Management

You can use classification tags to determine how to set masking policies to protect the privacy of the data.

Anonymization

You can use classification to streamline anonymization of personal data. Anonymization relies on classification privacy categories to protect the identity of the associated subjects while still making their data available for analysis.

Supported Table / View Types and Column Data Types

Snowflake supports classifying data stored in all types of tables and views, including:

  • External tables

  • Materialized views

  • Secure views

Classification can be performed on table/view columns of all supported data types except for the following data types:

  • GEOGRAPHY

  • BINARY

  • VARIANT

If a table/view contains columns that are not of a supported data type or the column contains all NULL values, the classification process ignores the columns and does not include them in the output.

Important

If your data represents NULL values with a value other than NULL, the accuracy of the classification results may be impacted.

Compute Costs

The classification process requires compute resources, which are provided by the virtual warehouse that is in use and running when classification is performed.

The amount of time needed to classify the data in a table/view (and, therefore, the number of credits consumed by the warehouse) is a function of the amount of data to be classified.

In particular, if a table/view has a large number of columns that support classification, the processing time can be impacted. However, as a general rule, the processing speed scales linearly with the warehouse size. In other words, each size increase for a warehouse (e.g. X-small to Small) typically reduces the processing time by half.

Classification Categories

Snowflake utilizes two category types for classifying data in table/view columns:

  • Semantic categories

  • Privacy categories

Semantic Categories

A semantic category identifies a column as storing personal attributes. Some of the semantic categories supported by Snowflake include:

  • Name

  • Address

  • Zip code

  • Phone number (currently US numbers only)

  • Age

  • Gender

For a complete list of the semantic categories supported in the current release, see Category Tag Values and Mappings. Additional semantic categories will be added in future releases.

Privacy Categories

If a column is determined to have a semantic category, the column is further classified according to one of the following privacy categories:

Identifier

Also known as direct identifiers, these attributes uniquely identify an individual (e.g. name, social security number, or phone number).

Quasi-identifier

Also known as indirect identifiers, these attributes, when combined with other attributes, can be used to uniquely identify an individual (e.g. age + gender + zip).

Sensitive

Personal attributes that are not identifying, but are information that individuals do not want disclosed for privacy reasons (e.g. salary or medical/healthcare status).

Note

Multiple semantic categories from all three privacy categories may be considered “Sensitive Personal Data”, “Special Categories of Data”, or similar terms under laws and regulations, and may require additional protections or controls.

Currently, classification does not tag data as both sensitive and identifying. In other words, classification is an “either-or” operation, which you must consider when creating rules to govern access to data identified as sensitive.

Semantic Category Probabilities and Alternates

In addition to identifying the semantic category and privacy category for a column, Snowflake also returns the following information about the semantic category for the column:

  • The probability that the classification process derived the correct semantic category.

  • A list of alternate semantic categories with which the column can be tagged (if the probability is below the 0.80 threshold and the process identified other possible semantic categories with a probability greater than 0.15).

For more details, see the EXTRACT_SEMANTIC_CATEGORIES function.

System Tags

Classification utilizes pre-defined system tags for the semantic and privacy categories:

  • For the SEMANTIC_CATEGORY tag, the possible tag values are the semantic categories (NAME, AGE, etc.). For the complete list of possible semantic category values, see Category Tag Values and Mappings.

  • For the PRIVACY_CATEGORY tag, the possible tag values are the privacy categories (IDENTIFIER, QUASI_IDENTIFIER, or SENSITIVE).

The system tags are stored in the CORE schema in the SNOWFLAKE read-only shared database. To view the tag names, use the SHOW TAGS command.

For example:

USE SCHEMA SNOWFLAKE.CORE;

SHOW TAGS;

To view the values assigned to the system tags after the tags have been extracted, see Viewing and Tracking Classification Data.

Back to top