Custom Data Classification

This topic provides concepts on Custom Data Classification in Snowflake.

Overview

Snowflake provides the CUSTOM_CLASSIFIER class in the SNOWFLAKE.DATA_PRIVACY schema to enable data engineers to extend their data classification capabilities based on their own knowledge of their data. After you create an instance of the class, you can call a method on the instance to define your own semantic category, specify the privacy category, and specify regular expressions to match column value patterns while optionally matching the column name.

By creating and using custom classification instances, you can:

  • Accelerate your data classification efforts.

  • Define industry- and domain-specific tags for columns containing sensitive data.

  • Leverage Snowflake to have more control over your efforts to track PII data.

Considerations

Choose a warehouse that matches the size of the data you are classifying. For more information, see Compute costs

About the custom classification algorithm

Snowflake uses an algorithm that is unique for custom classification compared to the algorithm for Data Classification. The reason for having different classification algorithms is to ensure stable results depending on how you choose to classify your data.

The custom classification algorithm uses a scoring rule to determine which semantic category system tag to recommend and which semantic category tags, if any, to suggest as alternatives. The scoring logic evaluates the regular expressions that you add to your instance, which you specify by calling the custom_classifier!ADD_REGEX method on your instance.

The scoring rule uses a default threshold value of 0.8 that equates to high confidence in terms of what the recommended tag should be. Eighty percent of the data in the sample must match the regular expressions that you add to the instance. The algorithm compares the score for a column against the threshold value and recommends a tag that corresponds to one of the following:

Note

It is possible for two custom classifiers to have the same score. In this case, a tie is resolved by evaluating the following:

  • Match percentage between respective custom categories.

  • Alphabetical order between the names of the custom categories.

In such a case, the winning category will be the recommended category and rest is contained in the alternates.

The following table summarizes the scoring algorithm and the recommended tag:

Name Matcher Provided

Value matches >= threshold

Name matches

Recommendation

True

True

True

Custom category

False

True

Snowflake category

True

False

Snowflake category

False

False

Snowflake category

False

True

Not applicable

Custom category

False

Not applicable

Snowflake category