Custom Data Classification¶
This topic provides concepts on Custom Data Classification in Snowflake.
Overview¶
Snowflake provides the CUSTOM_CLASSIFIER class in the SNOWFLAKE.DATA_PRIVACY schema to enable data engineers to extend their data classification capabilities based on their own knowledge of their data. After you create an instance of the class, you can call a method on the instance to define your own semantic category, specify the privacy category, and specify regular expressions to match column value patterns while optionally matching the column name.
By creating and using custom classification instances, you can:
Accelerate your data classification efforts.
Define industry- and domain-specific tags for columns containing sensitive data.
Leverage Snowflake to have more control over your efforts to track PII data.
Considerations¶
Choose a warehouse that matches the size of the data you are classifying. For more information, see Compute costs
About the custom classification algorithm¶
Snowflake uses an algorithm that is unique for custom classification compared to the algorithm for Data Classification. The reason for having different classification algorithms is to ensure stable results depending on how you choose to classify your data.
The custom classification algorithm uses a scoring rule to determine which semantic category system tag to recommend and which semantic category tags, if any, to suggest as alternatives. The scoring logic evaluates the regular expressions that you add to your instance, which you specify by calling the custom_classifier!ADD_REGEX method on your instance.
The scoring rule uses a default threshold value of 0.8
that equates to high confidence in terms of what the recommended tag should
be. Eighty percent of the data in the sample must match the regular expressions that you add to the instance. The algorithm compares the
score for a column against the threshold value and recommends a tag that corresponds to one of the following:
Custom classifier tag.
You can specify the threshold value for a custom classification instance by calling the custom_classifier!ADD_REGEX method on the instance.
Note
It is possible for two custom classifiers to have the same score. In this case, a tie is resolved by evaluating the following:
Match percentage between respective custom categories.
Alphabetical order between the names of the custom categories.
In such a case, the winning category will be the recommended category and rest is contained in the alternates.
The following table summarizes the scoring algorithm and the recommended tag:
Name Matcher Provided |
Value matches >= threshold |
Name matches |
Recommendation |
---|---|---|---|
True |
True |
True |
Custom category |
False |
True |
Snowflake category |
|
True |
False |
Snowflake category |
|
False |
False |
Snowflake category |
|
False |
True |
Not applicable |
Custom category |
False |
Not applicable |
Snowflake category |
Replication and cloning¶
Instances of the CUSTOM_CLASSIFIER class are replicated when you replicate a database.
Instances of the CUSTOM_CLASSIFIER class are cloned when you clone the schema that contains the instances.