Redact personally identifiable information (PII)¶
PII includes names, addresses, phone numbers, email addresses, tax identification numbers, and other data that can be used (alone or with other information) to identify an individual. Most organizations have regulatory and compliance requirements around handling PII data. AI_REDACT is a fully-managed AISQL function that helps you help redact PII from unstructured text data, using a large language model (LLM) hosted by Snowflake to identify PII and replace with placeholder values.
AI_REDACT can help you prepare text for call center coaching, sentiment analysis, insurance and medical analysis, and ML model training, among other use cases.
Tip
Use AI_PARSE_DOCUMENT or AI_TRANSCRIBE to convert document or speech data into text before applying AI_REDACT.
AI_REDACT¶
The AI_REDACT function replaces personally identifiable information (PII) in the input text with placeholder values.
Important
AI_REDACT performs redaction in a best-effort manner using AI models. Always review the output to ensure compliance with your organization’s data privacy policies. Please let Snowflake know if AI_REDACT fails to redact any PII in your data.
Regional availability¶
Limitations¶
Redaction is performed using AI models and may not find all personally identifiable information. Always review output to ensure compliance with your organization’s data privacy policies. Please reach out to Snowflake support if AI_REDACT fails to redact certain PII.
The COUNT_TOKENS and AI_COUNT_TOKENS functions do not yet support AI_REDACT.
At this time, AI_REDACT works best with well-formed English text. Performance may vary with other languages or text with many spelling, punctuation, or grammatical errors.
AI_REDACT currently redacts only US PII and some UK and Canadin PII, where noted in Detected PII categories.
AI_REDACT is currently limited in the number of tokens it can input and output. Input and output together can be up to 4,096 tokens. Output is limited to 1,024 tokens. If the input text is longer, split it into smaller chunks and redact each chunk separately, perhaps using SPLIT_TEXT_RECURSIVE_CHARACTER.
Note
A token is the smallest unit of data processed by the AI model. For English text, industry guidelines consider one token to be approximately four characters, or 0.75 words.
Detected PII categories¶
AI_REDACT supports redacting the following categorise of PII. The values in the Category column are the strings supported
in the optional categories argument. Subcategories are automatically detected when the parent category is specified.
Category
Notes
NAME
Also identifies name parts FIRST_NAME, MIDDLE_NAME, and LAST_NAME
PHONE_NUMBER
DATE_OF_BIRTH
GENDER
Recognizes values MALE, FEMALE, NONBINARY
AGE
ADDRESS
Also identifies address parts:
STREET_ADDRESS (US, UK, CA)
POSTAL_CODE (US, UK, CA)
CITY (US, UK, CA)
ADMINISTRATIVE_AREA_1 (US state, CA province)
ADMINISTRATIVE_AREA_2 (US county, borough, or township)
NATIONAL_ID
Identifies US social security numbers
PASSPORT
Identifies US, UK, CA passport numbers
TAX_IDENTIFIER
Identifies Individual Taxpayer Numbers (ITNs)
PAYMENT_CARD_DATA
Also identifies payment card parts PAYMENT_CARD_NUMBER, PAYMENT_CARD_EXPIRATION_DATE, PAYMENT_CARD_CVV
DRIVERS_LICENSE
Identifies US, UK, CA driver’s license numbers
IP_ADDRESS
Note
When AI_REDACT identifies a subcategory, such as a first name, the output contains the main category placeholder, such as [NAME].
Cost considerations¶
AI_REDACT incurs costs based on the number of input and output tokens processed, as with other AISQL functions. See the Snowflake Pricing Guide for details.
Examples¶
Basic examples¶
The following example redacts a name and an address from the input text.
SELECT AI_REDACT(
input => 'My name is John Smith and I live at twenty third street, San Francisco.'
);
Output:
My name is [NAME] and I live at [ADDRESS]
The following example redacts only names and email addresses from the input text. Note that the text only contains a first name, but this (being a subcategory of name) is still recognized and redacted as [NAME]. The input text does not contain an email address, so no email placeholder appears in the output.
SELECT AI_REDACT(
input => 'My name is John and I live at twenty third street, San Francisco.',
categories => ['NAME', 'EMAIL']
);
Output:
My name is [FIRST_NAME] and I live at twenty third street, San Francisco.
End-to-end example¶
The following example processes rows from one table and inserts the redacted output into another table. You could use a similar approach to store the redacted data in a column in an existing table.
After redaction, the text is passed to AI_SENTIMENT to extract overall sentiment information.
-- Create a table with unredacted text
CREATE OR REPLACE TABLE raw_table AS
SELECT 'My previous manager, Washington, used to live in Kirkland. His first name was Mike.' AS my_column
UNION ALL
SELECT 'My name is William and I live in San Francisco. You can reach me at (415).450.0973';
-- view unredacted data
SELECT * FROM raw_table;
-- Create a redaction table
CREATE OR REPLACE TABLE redaction_table (
value VARCHAR
);
-- Redact PII from raw_table and insert into redaction_table
INSERT INTO redaction_table
SELECT AI_REDACT(my_column) AS value FROM raw_table;
-- view redacted results
SELECT * FROM redaction_table;
-- Run AI_SENTIMENT on redacted text
SELECT
value AS redacted_text,
AI_SENTIMENT(value) AS summary_sentiment
FROM redaction_table;
Error handling example¶
Ordinarily, AI_REDACT raises an error if it cannot process the input text. When processing multiple rows, an error causes the entire operation to fail. To avoid this, you can set the session parameter AI_SQL_ERROR_HANDLING_USE_FAIL_ON_ERROR, to FALSE, which causes error conditions to return NULL rather than raising an error.
ALTER SESSION SET AI_SQL_ERROR_HANDLING_USE_FAIL_ON_ERROR=FALSE;
With this session parameter set, by passing TRUE as the last argument to AI_REDACT, the return value is an OBJECT that contains separate fields for the redacted text and any error message. One or the other is NULL depending on whether the AI_REDACT call processed successfully. You can use these fields to record the error message for any row that failed to process.
For example, the predecding example can be modified to record the error messages as follows, replacing the indicated blocks with the modified code:
ALTER SESSION SET AI_SQL_ERROR_HANDLING_USE_FAIL_ON_ERROR=FALSE;
-- Create a redaction table with columns for value and error message
CREATE OR REPLACE TABLE redaction_table (
value VARCHAR,
error VARCHAR
);
-- Redact PII from raw_table and insert into redaction_table
-- Both the redacted text and any error message are stored
INSERT INTO redaction_table
SELECT
result:value::STRING AS value,
result:error::STRING AS error
FROM (SELECT AI_REDACT(my_column, TRUE) AS result FROM raw_table);
Legal notices¶
The data classification of inputs and outputs are as set forth in the following table.
Input data classification |
Output data classification |
Designation |
|---|---|---|
Usage Data |
Customer Data |
Generally available functions are Covered AI Features. Preview functions are Preview AI Features. [1] |
For additional information, refer to Snowflake AI and ML.