Use Data Classification with the classic APIs

This topic provides examples of classifying data with the classic APIs in Snowflake: EXTRACT_SEMANTIC_CATEGORIES and ASSOCIATE_SEMANTIC_CATEGORY_TAGS. These APIs are no longer being updated to coincide with updates to Data Classification.

You can continue to use these APIs. However, you should update your workflows to use the approaches as shown in Use Data Classification.

Overview

The following examples for classifying data address different use cases based on scope:

If you are a new Classification user, start with classifying a single table and then proceed to the other two examples.

Common table and roles

The examples in this topic use the table and custom roles shown below:

  • Table: my_db.my_schema.hr_data, which contains data about employees.

  • Roles:

    • data_engineer: classifies tables.

    • policy_admin: protects personal information with a masking policy.

    • analyst: serves as an example of a custom role for which you might want to restrict access. The examples assume this role exists.

    • Database roles:

      • GOVERNANCE_ADMIN: assigns the Classification system tags.

      • GOVERNANCE_VIEWER: queries the Account Usage views relating to tags and masking policies.

  1. Create the table:

    CREATE OR REPLACE TABLE hr_data (
        age INTEGER,
        email_address VARCHAR,
        fname VARCHAR,
        lname VARCHAR
        );
    
    Copy
  2. Load the table (details not shown).

  3. Create the custom roles and grant the necessary privileges to those roles.

  4. The data_engineer role needs access to the table to run the Classification process. Note that this example grants the GOVERNANCE_ADMIN database role to the data_engineer role because you can not switch roles to a database role:

    use role accountadmin;
    
    create role data_engineer;
    
    grant usage on database my_db to role data_engineer;
    
    grant usage on schema my_db.my_schema to role data_engineer;
    
    grant select, update on table my_db.my_schema.hr_data to role data_engineer;
    
    grant database role snowflake.governance_admin to role data_engineer;
    
    Copy
  5. The policy_admin custom role needs to assign a masking policy to any column containing PII data:

    use role accountadmin;
    create role policy_admin;
    grant apply masking policy on account to role policy_admin;
    grant database role snowflake.governance_viewer to role policy_admin;
    
    Copy

Classify a single table

The single table example expands on the three-step Classification process (i.e. analyze, review, and apply) to apply the results to revoke table access from the analyst custom role.

Step 1: Classify the table columns

In this step, the data_engineer runs the Classification process and the policy_admin protects the column data with a masking policy.

  1. Analyze: Use the data_engineer custom role to call the EXTRACT_SEMANTIC_CATEGORIES function to classify the columns in the table named my_db.my_schema.hr_data:

    USE ROLE data_engineer;
    
    SELECT EXTRACT_SEMANTIC_CATEGORIES('my_db.my_schema.hr_data');
    
    Copy
  2. Review: The data engineer reviews the results to ensure they make sense.

  3. Apply: The data engineer assigns a Classification system tag to the columns.

    The data engineer has two options: automated assignment or manual assignment.

    To assign the system tags automatically, call the ASSOCIATE_SEMANTIC_CATEGORY_TAGS stored procedure. Note:

    • The fully-qualified name of the table and the function from the first step are arguments for the stored procedure.

    • The stored procedure reruns the EXTRACT_SEMANTIC_CATEGORIES function. If you want to preserve the results from the first step, save the results to a table prior to calling the stored procedure.

      CALL ASSOCIATE_SEMANTIC_CATEGORY_TAGS(
         'my_db.my_schema.hr_data',
          EXTRACT_SEMANTIC_CATEGORIES('my_db.my_schema.hr_data')
      );
      
      Copy

      If the stored procedure executes successfully, it returns a message similar to the following: Applied tag semantic_category to <n> columns. Applied tag privacy_category to <n> columns.

    Otherwise, when the stored procedure doesn’t execute or if the decision is to assign the Classification system tag to each column as needed manually, use an ALTER TABLE … ALTER COLUMN statement. For example, assign either system tag to the FNAME column (i.e. first name).

    USE ROLE data_engineer;
    
    ALTER TABLE my_db.my_schema.hr_data
      MODIFY COLUMN fname
      SET TAG SNOWFLAKE.CORE.SEMANTIC_CATEGORY='NAME';
    
    Copy

    or

    ALTER TABLE my_db.my_schema.hr_data
      MODIFY COLUMN fname
      SET TAG SNOWFLAKE.CORE.PRIVACY_CATEGORY='IDENTIFIER';
    
    Copy

Step 2: Protect the table columns

This step assumes that several columns in the my_db.my_schema.hr_data table have the PRIVACY_CATEGORY = 'IDENTIFIER' tag assigned to these columns and that there is a need to protect these columns with a masking policy.

To protect these columns:

  1. Use the policy_admin role to find the columns that have the IDENTIFIER privacy tag applied:

    USE ROLE policy_admin;
    
    SELECT * FROM SNOWFLAKE.ACCOUNT_USAGE.TAG_REFERENCES
      WHERE TAG_NAME = 'PRIVACY_CATEGORY'
      AND TAG_VALUE = 'IDENTIFIER';
    
    Copy

    The latency for the TAG_REFERENCES view can be up to 120 minutes. If you need results sooner and know the name of the column for which you are querying the classification tags, you can use the TAG_REFERENCES or TAG_REFERENCES_ALL_COLUMNS table functions instead.

  2. Use the policy_admin role to apply masking policies to the appropriate columns. For example, the following statement applies the identifier_mask masking policy to the fname column:

    ALTER TABLE my_db.my_schema.hr_data
      MODIFY COLUMN fname
      SET MASKING POLICY governance.policies.identifier_mask;
    
    Copy

Step 3: Use system tags to revoke access

Finally, the security administrator (i.e. a user with the SECURITYADMIN role):

  • Queries the TAG_REFERENCES view to find all the columns with the IDENTIFIER privacy tag value.

  • Revokes the SELECT privilege for the analyst custom role on tables for which the PRIVACY_CATEGORY = 'IDENTIFIER' tag is set on a column:

Tip

You do not need to use the SECURITYADMIN system role to perform these tasks; you can use any custom role that has been assigned the necessary privileges.

USE ROLE SECURITYADMIN;

SELECT * FROM SNOWFLAKE.ACCOUNT_USAGE.TAG_REFERENCES
    WHERE TAG_NAME = 'PRIVACY_CATEGORY'
    AND TAG_VALUE= 'IDENTIFIER';

REVOKE SELECT ON TABLE my_db.my_schema.hr_data FROM ROLE analyst;
Copy

Classify all tables in a schema

This example shows how to classify all the tables in a schema using two user-defined stored procedures:

  • classify_schema: Lists all the tables in the schema, creates a table to store the classification results, and then extracts the classification tags from each table and stores them in the results table.

  • associate_tag_batch: Uses the results of the classify_schema stored procedure to assign the Classification system tags to all table columns in the schema automatically and returns the number of tags assigned to each table.

Important

The stored procedure named classify_schema creates a temporary table to store results. The temporary table exists for the duration of the user session for the user that calls this stored procedure. When the user session expires, Snowflake drops the temporary table and the user needs to call the stored procedure again to recreate the temporary table.

If the temporary table needs to be preserved, remove the temp keyword from the sqlText command to create the table.

For more details, see the TEMP[ORARY] option in the CREATE TABLE command.

  1. Create the first procedure named classify_schema:

    create or replace procedure classify_schema(schema_name string, result_table string)
    returns object language JavaScript
    as $$
    // 1
    const table_names = snowflake.execute({
      sqlText: `show terse tables in schema identifier(?)`,
      binds: [SCHEMA_NAME],
    });
    
    // helper function for quoted table names
    function quote(name) {
      return '"'+ name.replace(/"/g,'""') + '"';
    }
    
    // create table to store results in. if it already exists, we will add to it rather than overwrite
    snowflake.execute({
        sqlText: `create temp table if not exists identifier(?) (table_name string, result variant)`,
        binds: [RESULT_TABLE],
    })
    // loop through tables
    while (table_names.next()) {
      let name = table_names.getColumnValue('name');
      // add schema to table name
      name = SCHEMA_NAME + "." + quote(name);
      // insert qualified table name and result into result table
      const results = snowflake.execute({
        sqlText: `insert into identifier(?) select ?, extract_semantic_categories(?)`,
        binds: [RESULT_TABLE, name, name],
      });
    }
    // return the number of tables classified
    return {tables_classified: table_names.getRowCount()};
    $$;
    
    Copy
  2. Create the second procedure named associate_tag_batch:

    create or replace procedure associate_tag_batch(result_table string)
    returns Object language JavaScript
    as $$
    // get table names and classification results to loop through
    const tags_to_apply = snowflake.execute({
      sqlText: `select table_name, result from identifier(?)`,
      binds: [RESULT_TABLE],
    });
    
    const out = {};
    while (tags_to_apply.next()) {
      // get table name
      const name = tags_to_apply.getColumnValue('TABLE_NAME');
      // get classification result
      const classification_results = tags_to_apply.getColumnValue('RESULT');
      // call associate semantic category tags with table name and classification result
      const results = snowflake.execute({
        sqlText: `call associate_semantic_category_tags(?, parse_json(?))`,
        binds: [name, JSON.stringify(classification_results)],
      });
      results.next();
      out[name] = results.getColumnValue(1).split('\n');
    }
    // return number of tags applied per table
    return out;
    $$;
    
    Copy
  3. Call the classify_schema stored procedure, using the name of the schema you would like to classify and the name of a temporary table to hold the results of EXTRACT_SEMANTIC_CATEGORY for each table as the arguments:

    call classify_schema('my_db.my_schema','my_temporary_classification_table');
    
    Copy
  4. Review the results in the temporary table and modify as necessary.

  5. When satisfied with the results, call the associate_tag_batch stored procedure to assign the Classification system tags to the table columns:

    call associate_tag_batch('my_temporary_classification_table');
    
    Copy

Classify all tables in a database

This example shows how to classify all the tables in a database by using two stored procedures:

  • classify_database: Classifies all tables within a schema for each schema in the database, and returns the number of tables and the number of schemas that are classified.

  • associate_tag_batch: Performs the same operations as defined in Classify all tables in a schema (in this topic).

  1. Create the classify_database stored procedure:

    create or replace procedure classify_database(database_name string, result_table string)
    returns Object language JavaScript
    as $$
    // get list of schemas in database
    const schema_names = snowflake.execute({
      sqlText: `show terse schemas in database identifier(?)`,
      binds: [DATABASE_NAME],
    });
    
    // helper function for quoted schema names
    function quote(name) {
      return '"'+ name.replace(/"/g,'""') + '"';
    }
    
    // counter for tables. will use result from classify_schema to increment
    let table_count = 0
    while (schema_names.next()) {
      let name = schema_names.getColumnValue('name');
      // skip the information schema
      if (name == "INFORMATION_SCHEMA") {
        continue;
      }
      // add database name to schema
      name = DATABASE_NAME + "." + quote(name);
      // call classify_schema on each schema. This will loop over tables in schema
      const results = snowflake.execute({
        sqlText: `call classify_schema(?, ?)`,
        binds: [name, RESULT_TABLE],
      });
      results.next();
      // increment total number of tables by the number of tables in the schema
      table_count += results.getColumnValue(1).tables_classified ;
    }
    
    return {
        tables_classified: table_count,
        // subtract one from number of schemas because we skip the information schema
        schemas_classified: schema_names.getRowCount() - 1,
    };
    $$;
    
    Copy
  2. Call the classify_database stored procedure with the name of the database you wish to classify and the name of the temporary table to store the results in each schema of the database:

    call classify_database('my_db','my_temporary_classification_table');
    
    Copy
  3. Navigate to each schema and review the temporary table and revise if necessary.

  4. When satisfied with the results, call the associate_tag_batch stored procedure once for each schema to apply the tags to the tables in that schema. For example, if the database contains three schemas, call the stored procedure three times:

    call associate_tag_batch('my_temporary_classification_table');
    
    Copy