EXTRACT_SEMANTIC_CATEGORIES

Returns a set of categories (semantic and privacy) for each supported column in the specified table or view. To return the categories for a column, the column must use a data type that supports classification and does not contain all NULL values.

The categories are derived from the metadata and data contained in the columns, as well as the metadata about the columns and data. The privacy categories rely on the generated semantic categories, if any.

Syntax

EXTRACT_SEMANTIC_CATEGORIES( '<object_name>' [ , <max_rows_to_scan> ] )

Arguments

Required:

object_name

The name of the table, external table, view, or materialized view containing the columns to be classified. If a database and schema is not in use in the current session, the name must be fully-qualified.

The name must be specified exactly as it is stored in the database. If the name contains special characters, capitalization, or blank spaces, the name must be enclosed first in double-quotes and then in single quotes.

Optional:

max_rows_to_scan

The sample size of rows to use for determining the classification categories in the specified table/view.

Valid values: 1 to 10000

Default: 10000

Returns

The function returns a single JSON object containing a tag name/value pair for each column for which classification categories were generated. If a column is an unsupported data type or contains all NULL values, the column is not included in the output.

The JSON object has the following structure:

{
  "<col1_name>": {
    "extra_info" : {
      "alternates" : [<semantic_categories>],
      "probability" : "<number>"
    },
    "privacy_category" : "<value>",
    "semantic_category" : "<value>"
  },
 ...
 ...
  "<colN_name>": {
    "extra_info" : {
      "alternates" : [<semantic_categories>],
      "probability" : "<number>"
    },
    "privacy_category" : "<value>",
    "semantic_category" : "<value>"
  }
}

Where:

extra_info

Object containing the following key-value pairs:

  • alternates is an array of other possible semantic categories if probability is less than 0.80. To be listed as an alternate, the semantic category must have a probability of 0.15 (or higher), but less than 0.80.

  • probability is a number indicating the likelihood that the classification algorithm derived the correct value for semantic_category. The maximum probability is 1.00 and the minimum threshold is 0.15 (i.e. if all semantic categories have a probability lower than 0.15, no value is returned for probability or semantic_category).

privacy_category

Privacy category tag for the column, with IDENTIFIER, QUASI_IDENTIFIER, or SENSITIVE as the tag value, based on the semantic category for the column.

semantic_category

Semantic category tag for the column, with the tag value derived by the classification algorithm. For a list of the possible values, see Category Tag Values and Mappings.

For example, if a table has columns named fname, lname, age, and email_address, the function might return output similar to:

{
  "AGE": {
    "extra_info" : {
      "alternates" : [],
      "probability" : "`1.00`"
    },
    "privacy_category" : "QUASI_IDENTIFIER",
    "semantic_category" : "AGE"
  },
"EMAIL_ADDRESS": {
  "extra_info": {
    "alternates": [],
    "probability": "1.00"
  },
  "privacy_category": "IDENTIFIER",
  "semantic_category": "EMAIL"
},
  "FNAME": {
    "extra_info" : {
      "alternates" : [],
      "probability" : "1.00"
    },
    "privacy_category" : "IDENTIFIER",
    "semantic_category" : "NAME"
  },
  "LNAME": {
    "extra_info" : {
      "alternates" : [],
      "probability" : "1.00"
    },
    "privacy_category" : "IDENTIFIER",
    "semantic_category" : "NAME"
  }
}

Usage Notes

  • The function requires an active, running warehouse in the session in which the function is executed.

    The size of the warehouse can impact the amount of time needed to classify the data in the table/view, particularly for tables with a large number of columns containing data to classify. As a general rule, a larger warehouse results in reduced time.

  • Serverless tasks cannot call this function.

    As a workaround, have a user-managed task call this function.

  • For each column that classification identifies as having multiple possible semantic categories:

    • For all categories with a probability of 0.80 (or higher), the category with the highest probability is listed as the value for semantic_category.

    • All categories with a probability of 0.15 (or higher) are listed as alternates.

    • If no categories have a probability of 0.80 (or higher), no value is listed for semantic_category and probability. Likewise, if no categories have a probability of 0.15 (or higher), no values are listed for alternates.

Examples

Extract the semantic and privacy categories for the my_db.my_schema.hr_data table using the default (10000) for the number of rows to scan:

USE ROLE data_engineer;

USE WAREHOUSE classification_wh;

SELECT EXTRACT_SEMANTIC_CATEGORIES( 'my_db.my_schema.hr_data' );

Same as the previous example, but limited to scanning only 5000 rows in the table:

USE ROLE data_engineer;

SELECT EXTRACT_SEMANTIC_CATEGORIES( 'my_db.my_schema.hr_data', 5000 );

Same as the first example, but stores results in a table:

USE ROLE data_engineer;

CREATE OR REPLACE TABLE classification_results(v VARIANT) AS
  SELECT EXTRACT_SEMANTIC_CATEGORIES( 'my_db.my_schema.hr_data');

Once the results are stored in a table, you can revise them before using ASSOCIATE_SEMANTIC_CATEGORY_TAGS to apply them.

Back to top