- Categories:
EXTRACT_SEMANTIC_CATEGORIES¶
Returns a set of categories (semantic and privacy) for each supported column in the specified table or view. To return the categories for a column, the column must use a data type that supports classification and does not contain all NULL values.
The categories are derived from the metadata and data contained in the columns, as well as the metadata about the columns and data. The privacy categories rely on the generated semantic categories, if any.
Syntax¶
EXTRACT_SEMANTIC_CATEGORIES( '<object_name>' [ , <max_rows_to_scan> ] )
Arguments¶
Required:
object_name
The name of the table, external table, view, or materialized view containing the columns to be classified. If a database and schema is not in use in the current session, the name must be fully-qualified.
The name must be specified exactly as it is stored in the database. If the name contains special characters, capitalization, or blank spaces, the name must be enclosed first in double-quotes and then in single quotes.
Optional:
max_rows_to_scan
The sample size of rows to use for determining the classification categories in the specified table/view.
Valid values:
1
to10000
Default:
10000
Returns¶
The function returns a single JSON object containing a tag name/value pair for each column for which classification categories were generated. If a column is an unsupported data type or contains all NULL values, the column is not included in the output.
The JSON object has the following structure:
{
"<col1_name>": {
"extra_info" : {
"alternates" : [<semantic_categories>],
"probability" : "<number>"
},
"privacy_category" : "<value>",
"semantic_category" : "<value>"
},
...
...
"<colN_name>": {
"extra_info" : {
"alternates" : [<semantic_categories>],
"probability" : "<number>"
},
"privacy_category" : "<value>",
"semantic_category" : "<value>"
}
}
Where:
extra_info
Object containing the following key-value pairs:
alternates
is an array of other possible semantic categories ifprobability
is less than0.80
. To be listed as an alternate, the semantic category must have a probability of0.15
(or higher), but less than0.80
.probability
is a number indicating the likelihood that the classification algorithm derived the correct value forsemantic_category
. The maximum probability is1.00
and the minimum threshold is0.15
(i.e. if all semantic categories have a probability lower than0.15
, no value is returned forprobability
orsemantic_category
).
privacy_category
Privacy category tag for the column, with
IDENTIFIER
,QUASI_IDENTIFIER
, orSENSITIVE
as the tag value, based on the semantic category for the column.semantic_category
Semantic category tag for the column, with the tag value derived by the classification algorithm. For a list of the possible values, see System Tags & Categories.
For example, if a table has columns named fname
, lname
, age
, and email_address
, the function might return output similar to:
{
"AGE": {
"extra_info" : {
"alternates" : [],
"probability" : "1.00"
},
"privacy_category" : "QUASI_IDENTIFIER",
"semantic_category" : "AGE"
},
"EMAIL_ADDRESS": {
"extra_info": {
"alternates": [],
"probability": "1.00"
},
"privacy_category": "IDENTIFIER",
"semantic_category": "EMAIL"
},
"FNAME": {
"extra_info" : {
"alternates" : [],
"probability" : "1.00"
},
"privacy_category" : "IDENTIFIER",
"semantic_category" : "NAME"
},
"LNAME": {
"extra_info" : {
"alternates" : [],
"probability" : "1.00"
},
"privacy_category" : "IDENTIFIER",
"semantic_category" : "NAME"
}
}
Usage Notes¶
The function requires an active, running warehouse in the session in which the function is executed.
The size of the warehouse can impact the amount of time needed to classify the data in the table/view, particularly for tables with a large number of columns containing data to classify. As a general rule, a larger warehouse results in reduced time.
For each column that classification identifies as having multiple possible semantic categories:
For all categories with a probability of
0.80
(or higher), the category with the highest probability is listed as the value forsemantic_category
.All categories with a probability of
0.15
(or higher) are listed asalternates
.If no categories have a probability of
0.80
(or higher), no value is listed forsemantic_category
andprobability
. Likewise, if no categories have a probability of0.15
(or higher), no values are listed foralternates
.
Examples¶
Extract the semantic and privacy categories for the my_db.my_schema.hr_data
table using the default (10000
) for the number of
rows to scan:
USE ROLE data_engineer; USE WAREHOUSE classification_wh; SELECT EXTRACT_SEMANTIC_CATEGORIES( 'my_db.my_schema.hr_data' );
Same as the previous example, but limited to scanning only 5000 rows in the table:
USE ROLE data_engineer; SELECT EXTRACT_SEMANTIC_CATEGORIES( 'my_db.my_schema.hr_data', 5000 );
Same as the first example, but stores results in a table:
USE ROLE data_engineer; CREATE OR REPLACE TABLE classification_results(v VARIANT) AS SELECT EXTRACT_SEMANTIC_CATEGORIES( 'my_db.my_schema.hr_data');Once the results are stored in a table, you can revise them before using ASSOCIATE_SEMANTIC_CATEGORY_TAGS to apply them.