Using Classification

This topic provides a general overview of the classification workflow, as well as detailed examples of using classification to generate and apply tags to database tables, views, and columns.

In this Topic:

Classification Workflow

Classifying data in a table or view involves the following tasks:

  1. Delegate classification and tagging privileges:

    An account administrator (i.e. user with the ACCOUNTADMIN role) grants the necessary privileges to one or more custom roles (e.g. data_engineer) to apply tags.

  2. Analyze columns in the table/view:

    A data engineer (i.e. user with the data_engineer role) uses the EXTRACT_SEMANTIC_CATEGORIES function to analyze a table/view with no previous classification results. Snowflake generates semantic and privacy categories for the supported columns in the table/view.

  3. Classify the table/view columns by applying tags:

    The data engineer annotates the columns according to their semantic and privacy categories:

    Tip

    You can combine steps 2 and 3 by calling ASSOCIATE_SEMANTIC_CATEGORY_TAGS and passing it the output from EXTRACT_SEMANTIC_CATEGORIES.

    You can also flatten the output of EXTRACT_SEMANTIC_CATEGORIES and store it in a database table to make it easier to modify before applying and/or reuse.

  4. Monitor and maintain the classification tags for updated and new data:

    The tasks performed depend on whether data or columns are added/modified for the table/view:

    • If data is added or removed, the data engineer executes the EXTRACT_SEMANTIC_CATEGORIES function again on the table/view. Snowflake generates new semantic and privacy category results.

      The data engineer then compares the new results to existing results and manually modifies the tags accordingly or can use the ASSOCIATE_SEMANTIC_CATEGORY_TAGS stored procedure to apply the new results which will overwrite the current tags.

    • If new columns are added, the data engineer executes the EXTRACT_SEMANTIC_CATEGORIES function again on the table/view.

      The data engineer then manually annotates the new columns according to their semantic and privacy categories or uses the ASSOCIATE_SEMANTIC_CATEGORY_TAGS stored procedure.

Note

As noted earlier, users can choose not to use (or only selectively use) the results returned by the EXTRACT_SEMANTIC_CATEGORIES function.

In addition, users can choose to manually apply the classification categories provided by Snowflake, or define and apply their own set of classification tags.

For an example of manually applying tags, see Manually Apply Classification Tags to a Table / View.

Using Classification Tags

Create Tables and Roles Used in These Examples

Most examples in this topic use the table and custom roles shown below:

  • Tables

    • hr_data, which contains data about employees.

  • Roles

    • data_engineer, which can generate and apply classification tags.

    • policy_admin, which is responsible for ensuring personal information is masked.

    • analyst, an example of a role for which you might want to restrict access. (The sample code below assumes this role already exists.)

Create the table:

CREATE TABLE hr_data (
    age INTEGER,
    email_address VARCHAR,
    fname VARCHAR,
    lname VARCHAR
    );

Load the table (details not shown).

Create the custom roles and grant the necessary privileges to those roles:

USE ROLE ACCOUNTADMIN;

CREATE ROLE data_engineer;

GRANT IMPORTED PRIVILEGES ON DATABASE snowflake TO ROLE data_engineer;

GRANT USAGE ON DATABASE my_db TO ROLE data_engineer;

GRANT USAGE ON SCHEMA my_db.my_schema TO ROLE data_engineer;

GRANT SELECT, UPDATE ON TABLE my_db.my_schema.hr_data TO ROLE data_engineer;

GRANT APPLY TAG ON ACCOUNT TO ROLE data_engineer;

CREATE ROLE policy_admin;

Use Classification Tags to Apply a Masking Policy

This example illustrates using classification to find personally identifiable data in columns in a table, tag the columns, and then use the tags to apply a masking policy to the data.

In this example:

  • The data_engineer custom role performs the following tagging-related tasks:

  • The policy_admin custom role performs the following masking-related tasks:

    • Uses the TAG_REFERENCES view to find the columns that have the IDENTIFIER privacy tag applied.

    • Applies the identifier_mask masking policy to the fname column.

Note

The latency for the TAG_REFERENCES view may be up to 120 minutes. If you need results sooner and know the name of the column for which you are querying the classification tags, you can use the TAG_REFERENCES or TAG_REFERENCES_ALL_COLUMNS table functions instead.

For an example, see Viewing and Tracking Classification Data (in this topic).

Use the data_engineer role to generate the classification tags:

USE ROLE data_engineer;

SELECT EXTRACT_SEMANTIC_CATEGORIES('my_db.my_schema.hr_data');

If the tags are acceptable, the data_engineer role can apply them to the table:

CALL ASSOCIATE_SEMANTIC_CATEGORY_TAGS('my_db.my_schema.hr_data',
                                      EXTRACT_SEMANTIC_CATEGORIES('my_db.my_schema.hr_data'));

If the stored procedure executes successfully, it returns a message similar to the following: Applied tag semantic_category to <n> columns. Applied tag privacy_category to <n> columns.

Use the policy_admin role to find the columns that have the IDENTIFIER privacy tag applied:

USE ROLE policy_admin;

SELECT * FROM SNOWFLAKE.ACCOUNT_USAGE.TAG_REFERENCES
    WHERE TAG_NAME = 'PRIVACY_CATEGORY'
    AND TAG_VALUE = 'IDENTIFIER';

Use the policy_admin role to apply masking policies to the appropriate columns. For example, the following statement applies the identifier_mask policy to the fname column:

ALTER TABLE my_db.my_schema.hr_data MODIFY COLUMN fname SET MASKING POLICY identifier_mask;

Use Classification Tags to Revoke Access

This example continues the example from Use Classification Tags to Apply a Masking Policy (in this topic), but uses the SECURITYADMIN system role to manage access privileges based on classification tags.

In this example:

  • The SECURITYADMIN role uses the TAG_REFERENCES view to find all the columns with the IDENTIFIER privacy tag.

  • The SECURITYADMIN role revokes SELECT access on the tables (containing the columns) from the analyst custom role.

Note

You do not need to use the SECURITYADMIN system role to perform these tasks; you can use any custom role that has been assigned the necessary privileges.

USE ROLE SECURITYADMIN;

SELECT * FROM SNOWFLAKE.ACCOUNT_USAGE.TAG_REFERENCES
    WHERE TAG_NAME = 'PRIVACY_CATEGORY'
    AND TAG_VALUE= 'IDENTIFIER';

REVOKE SELECT ON TABLE my_db.my_schema.hr_data FROM ROLE analyst;

Manually Apply Classification Tags to a Table / View

This example continues the example from Use Classification Tags to Apply a Masking Policy (in this topic), but assumes the generated classification categories were not applied to the table.

In this example:

  • After generating classification categories for the table, the data_engineer role reviews the results and decides not to use the stored procedure to apply the categories.

  • Instead, the data_engineer role uses the ALTER TABLE … MODIFY COLUMN command to manually apply the semantic and privacy category tags to the fname column in the table.

USE ROLE data_engineer;

ALTER TABLE my_db.my_schema.hr_data MODIFY COLUMN fname SET TAG SNOWFLAKE.CORE.SEMANTIC_CATEGORY='NAME';

ALTER TABLE my_db.my_schema.hr_data MODIFY COLUMN fname SET TAG SNOWFLAKE.CORE.PRIVACY_CATEGORY='IDENTIFIER';

Classify All Tables in a Schema or Database

Use Stored Procedures to Classify All Tables in a Schema

This example shows how to classify all the tables in a schema using two stored procedures that you create:

  • classify_schema: Lists all the tables in the schema, creates a table to store the classification results, and then extracts the classification tags from each table and stores them in the results table.

  • associate_tag_batch: Associates the tags in the results table with all the tables in the schema.

Important

The stored procedure named classify_schema creates a temporary table to store results. The temporary table exists for the duration of the user session for the user that calls this stored procedure. When the user session expires, Snowflake drops the temporary table and the user needs to call the stored procedure again to recreate the temporary table.

If the temporary table needs to be preserved, remove the temp keyword from the sqlText command to create the table.

For more details, see the TEMP[ORARY] option in the CREATE TABLE command.

Create the first procedure (classify_schema):

create or replace procedure classify_schema(schema_name string, result_table string)
returns object language JavaScript
as $$
// get list of tables in schema
const table_names = snowflake.execute({
  sqlText: `show terse tables in schema identifier(?)`,
  binds: [SCHEMA_NAME],
});

// helper function for quoted table names
function quote(name) {
  return '"'+ name.replace(/"/g,'""') + '"';
}

// create table to store results in. if it already exists, we will add to it rather than overwrite
snowflake.execute({
    sqlText: `create temp table if not exists identifier(?) (table_name string, result variant)`,
    binds: [RESULT_TABLE],
})
// loop through tables
while (table_names.next()) {
  let name = table_names.getColumnValue('name');
  // add schema to table name
  name = SCHEMA_NAME + "." + quote(name);
  // insert qualified table name and result into result table
  const results = snowflake.execute({
    sqlText: `insert into identifier(?) select ?, extract_semantic_categories(?)`,
    binds: [RESULT_TABLE, name, name],
  });
}
// return the number of tables classified
return {tables_classified: table_names.getRowCount()};
$$;

Create the second procedure (associate_tag_batch):

create or replace procedure associate_tag_batch(result_table string)
returns Object language JavaScript
as $$
// get table names and classification results to loop through
const tags_to_apply = snowflake.execute({
  sqlText: `select table_name, result from identifier(?)`,
  binds: [RESULT_TABLE],
});

const out = {};
while (tags_to_apply.next()) {
  // get table name
  const name = tags_to_apply.getColumnValue('TABLE_NAME');
  // get classification result
  const classification_results = tags_to_apply.getColumnValue('RESULT');
  // call associate semantic category tags with table name and classification result
  const results = snowflake.execute({
    sqlText: `call associate_semantic_category_tags(?, parse_json(?))`,
    binds: [name, JSON.stringify(classification_results)],
  });
  results.next();
  out[name] = results.getColumnValue(1).split('\n');
}
// return number of tags applied per table
return out;
$$;

Then, perform the following steps:

  1. Call the classify_schema stored procedure with the name of the schema you would like to classify and the name of a temporary table to hold the results of EXTRACT_SEMANTIC_CATEGORY for each table:

    call classify_schema('my_db.my_schema','my_temporary_classification_table');
    
  2. Review the results in the temporary table and modify as necessary.

  3. When satisfied with the results, call the associate_tag_batch stored procedure to apply the tags:

    call associate_tag_batch('my_temporary_classification_table');
    

Use Stored Procedures to Classify All Tables in a Database

This example shows how to classify all the tables in a database by using two stored procedures:

  • classify_database (new for this example).

  • associate_tag_batch (created in the example in the previous section).

Create the classify_database stored procedure:

create or replace procedure classify_database(database_name string, result_table string)
returns Object language JavaScript
as $$
// get list of schemas in database
const schema_names = snowflake.execute({
  sqlText: `show terse schemas in database identifier(?)`,
  binds: [DATABASE_NAME],
});

// helper function for quoted schema names
function quote(name) {
  return '"'+ name.replace(/"/g,'""') + '"';
}

// counter for tables. will use result from classify_schema to increment
let table_count = 0
while (schema_names.next()) {
  let name = schema_names.getColumnValue('name');
  // skip the information schema
  if (name == "INFORMATION_SCHEMA") {
    continue;
  }
  // add database name to schema
  name = DATABASE_NAME + "." + quote(name);
  // call classify_schema on each schema. This will loop over tables in schema
  const results = snowflake.execute({
    sqlText: `call classify_schema(?, ?)`,
    binds: [name, RESULT_TABLE],
  });
  results.next();
  // increment total number of tables by the number of tables in the schema
  table_count += results.getColumnValue(1).tables_classified ;
}

return {
    tables_classified: table_count,
    // subtract one from number of schemas because we skip the information schema
    schemas_classified: schema_names.getRowCount() - 1,
};
$$;

Then, perform the following steps:

  1. Call the classify_database stored procedure with the name of the database you wish to classify and the name of the temporary table to store the results in each schema of the database:

    call classify_database('my_db','my_temporary_classification_table');
    
  2. Navigate to each schema and review the temporary table and revise if necessary.

  3. When satisfied with the results, call the associate_tag_batch stored procedure once for each schema to apply the tags to the tables in that schema:

    call associate_tag_batch('my_temporary_classification_table');
    

Converting Classification Output from Semi-structured to Structured Format

The semantic classification function EXTRACT_SEMANTIC_CATEGORIES returns a VARCHAR in JSON format, which is a semi-structured format. If you want to convert the output to a structured (tabular) format, use the FLATTEN table function, as shown below.

These examples use the table created in Create Tables and Roles Used in These Examples (in this topic).

The following shows the output of the EXTRACT_SEMANTIC_CATEGORIES function without flattening:

SELECT EXTRACT_SEMANTIC_CATEGORIES('hr_data');
+---------------------------------------------+
| EXTRACT_SEMANTIC_CATEGORIES('HR_DATA')      |
|---------------------------------------------|
| {                                           |
|   "AGE": {                                  |
|     "extra_info": {                         |
|       "alternates": [],                     |
|       "probability": "1.00"                 |
|     },                                      |
|     "privacy_category": "QUASI_IDENTIFIER", |
|     "semantic_category": "AGE"              |
|   },                                        |
|   "EMAIL_ADDRESS": {                        |
|     "extra_info": {                         |
|       "alternates": [],                     |
|       "probability": "1.00"                 |
|     },                                      |
|     "privacy_category": "IDENTIFIER",       |
|     "semantic_category": "EMAIL"            |
|   },                                        |
|   "FNAME": {                                |
|     "extra_info": {                         |
|       "alternates": [],                     |
|       "probability": "1.00"                 |
|     },                                      |
|     "privacy_category": "IDENTIFIER",       |
|     "semantic_category": "NAME"             |
|   },                                        |
|   "LNAME": {                                |
|     "extra_info": {                         |
|       "alternates": [],                     |
|       "probability": "0.97"                 |
|     },                                      |
|     "privacy_category": "IDENTIFIER",       |
|     "semantic_category": "NAME"             |
|   }                                         |
| }                                           |
+---------------------------------------------+

The following shows the output of the EXTRACT_SEMANTIC_CATEGORIES function with flattening:

SELECT
    f.key::varchar as column_name,
    f.value:"privacy_category"::varchar as privacy_category,  
    f.value:"semantic_category"::varchar as semantic_category,
    f.value:"extra_info":"probability"::number(10,2) as probability,
    f.value:"extra_info":"alternates"::variant as alternates
  FROM
  TABLE(FLATTEN(EXTRACT_SEMANTIC_CATEGORIES('hr_data')::VARIANT)) AS f;
+---------------+------------------+-------------------+-------------+------------+
| COLUMN_NAME   | PRIVACY_CATEGORY | SEMANTIC_CATEGORY | PROBABILITY | ALTERNATES |
|---------------+------------------+-------------------+-------------+------------|
| AGE           | QUASI_IDENTIFIER | AGE               |        1.00 | []         |
| EMAIL_ADDRESS | IDENTIFIER       | EMAIL             |        1.00 | []         |
| FNAME         | IDENTIFIER       | NAME              |        1.00 | []         |
| LNAME         | IDENTIFIER       | NAME              |        0.97 | []         |
+---------------+------------------+-------------------+-------------+------------+

Note that the values in some fields, such as the PROBABILITY field, vary depending upon the characteristics of the data and the sample size.

Viewing and Tracking Classification Data

Snowflake provides Account Usage and Information Schema views and functions that can be used for viewing the semantic and privacy tags generated by classification:

  • To obtain a list of all the classification tags that have been associated with columns, you can query the TAG_REFERENCES view (in Account Usage); however, note the latency on the view may be up to 120 minutes.

    For example:

    SELECT * FROM SNOWFLAKE.ACCOUNT_USAGE.TAG_REFERENCES
        WHERE TAG_NAME = 'PRIVACY_CATEGORY'
        ORDER BY OBJECT_DATABASE, OBJECT_SCHEMA, OBJECT_NAME, COLUMN_NAME;
    
  • If you know the name of the specific table/view, you can use the TAG_REFERENCES_ALL_COLUMNS table function (in Information Schema), which does not have any latency.

    The function returns a list of all the classification tags on the columns in the specified table/view.

  • If you know the name of the specific table/view column for which you wish to view the classification tags, you can use the TAG_REFERENCES table function (in Information Schema), which also does not have any latency.

    For example:

    SELECT * FROM
        TABLE(my_db.information_schema.tag_references('my_db.my_schema.hr_data.fname', 'COLUMN'));
    
Back to top