GENERATE_SYNTHETIC_DATA

The procedure generates synthetic data from one or more tables, based on data from input tables, and returns a table that contains metrics about the generated data, such as the coefficient of difference (similarity) between the source data and the generated data.

This stored procedure uses the caller’s rights to generate the output table.

Read the requirements for running this procedure. If any requirements are not met, the request will fail before it starts generating data.

Learn more about synthetic data usage.

Syntax

SNOWFLAKE.DATA_PRIVACY.GENERATE_SYNTHETIC_DATA(<configuration_object>)
Copy

Arguments

configuration_object

An OBJECT that specifies the details of the request. You can use an OBJECT constant to specify this object.

The OBJECT value has the following structure:

{
  'datasets': [
    {
      'input_table': '<input_table_name>',
      'output_table' : '<output_table_name>',
      'columns': {
        '<column_name>': {
          <property_name>: <property_value>
        }
        , ...
      }
    }
    , ...
  ],
  'similarity_filter': <boolean>,
  'replace_output_tables': <boolean>,
  'consistency_secret': '<string_literal>'
}
Copy

The OBJECT value contains the following key-value pairs:

datasets

An array specifying the data to generate. Each element in the array is an OBJECT value that defines a single input-output table pair. You can specify a maximum of five table pairs.

A child OBJECT value representing a single input-output table pair has the following properties:

input_table

The fully-qualified name of the input table from which to generate synthetic data. If the table does not exist or cannot be accessed, Snowflake returns an error message. See Using synthetic data in Snowflake for more input table requirements.

If the identifier contains spaces or special characters, the entire string must be enclosed in double quotes. Identifiers enclosed in double quotes are also case-sensitive.

For more information, see Identifier requirements.

output_table

The fully-qualified name of the output table to store the synthetic data generated from input_table. The generated table will have the same permissions and policies as if the user had called CREATE TABLE with default values. If the table already exists and replace_output_tables=TRUE, the existing table will be overwritten.

In addition, the identifier must start with an alphabetic character and cannot contain spaces or special characters unless the entire identifier string is enclosed in double quotes (for example, "My object"). Identifiers enclosed in double quotes are also case-sensitive.

For more information, see Identifier requirements.

columns

(Optional) An OBJECT value specifying additional properties for specific columns. Each field in the OBJECT defines properties for a single column. You do not need to define properties for all columns, or any columns at all. For each field:

  • The key is the column name. Properties in the value should be applied to this column.

  • The value is an OBJECT value containing any of the following key-value pairs:

    • join_key: BOOLEAN, where TRUE indicates that this is a join key column. This cannot be used in a column labeled categorical. Column must be a string, numeric, or Boolean value. Learn about join keys.

      Default: FALSE.

    • categorical: BOOLEAN, used to specify whether the column is a categorical string. Set to TRUE to enable the output to mark the data as non-sensitive, and able to be used in the output. Set to FALSE to redact values from the output. If not specified, will be determined by examining the data. Can be specified only for STRING columns. If set to TRUE, you cannot specify the replace or join_key fields for this column

      Default: Inferred based on the column data.

    • replace: Specifies an output format for STRING values. Can be used only on categorical string columns. The only values that can be used with join_key columns are uuid and email. Cannot be used when categorical is TRUE. If specified, you must provide a SECRET in consistency_secret. The following values are supported:

      replace values

      Value

      Description

      uuid

      A UUID. Example: 88d99a35-c4be-4022-b06a-41fb4629b46d

      name

      A first and last name in US locale style. Example: George Washington

      first_name

      A first name in US locale style. Example: George

      last_name

      A last name in US locale style. Example: Washington

      address

      An abbreviated address in US locale style. Example: 1600 Pennsylvania Ave

      full_address

      A detailed street address in US locale style. Example: 1600 Pennsylvania Ave NW, Washington DC 20500

      email

      An email address. Example: bdbQ6OPBS5ScOdJx8bVpFw@example.com

      phone

      A US-style 10-digit phone number in US locale style. Example: 212-555-1234

      ssn

      A US-style Social Security number. Example: 123-45-6789

      Default:

      • For join_key columns, uuid

      • For non-join-key columns, the value will be redacted.

similarity_filter

(Optional) Specifies whether to use a similarity filter when creating the synthetic data. Set this to TRUE to use the built-in privacy filter to remove rows from the target table that are too similar to rows in the input table. If FALSE, each output table will have the same number of rows as its input table; if TRUE, an output table might have fewer rows than its input table.

Default: FALSE

For more information, see Enhancing privacy.

replace_output_tables

(Optional) Specifies whether to overwrite the output synthetic data table when creating the synthetic data. Set this to TRUE to overwrite the output table.

Default: FALSE

consistency_secret

Name of a symmetric key SECRET object. Required if replace is specified, optional otherwise. If present, the procedure generates consistent values for STRING join keys across multiple runs that reuse the same consistency secret. If not present, join keys will be consistent between tables in the same run, but not between runs. Learn more.

Default: No consistency

Output

Column Name

Data Type

Description

created_on

TIMESTAMP

Time the synthetic data was generated.

table_name

VARCHAR

Name of the synthetic table.

table_schema

VARCHAR

Schema name of the synthetic table.

table_database

VARCHAR

Database name of the synthetic table.

columns

VARCHAR

A pair of columns in the synthetic table.

source_table_id

NUMBER

Internal/system-generated identifier of the input table.

source_table_name

VARCHAR

Name of the input table.

source_table_schema

VARCHAR

Schema name of the input table.

source_table_database

VARCHAR

Database name of the input table.

source_columns

VARCHAR

Names of the source columns.

metric_type

ENUM

correlation_coefficient_difference - Calculated as the absolute value of the correlation coefficient between two non-join columns in the source table and the same two columns in the generated data.

Currently, correlation_coefficient_difference is the only supported metric. This is the difference between the correlation coefficient of every combination of columns in the input table and the same coefficient in the generated data. Each row represents the correlation coefficient difference between one combination of columns. The column name pair is found in these columns: columns and source_columns.

metric_value

NUMBER

Value of the metric.

Access control requirements

To generate synthetic data, you must use a role with each the following grants:

  • USAGE on the warehouse that you want to use for queries.

  • SELECT on the input table from which you want to generate synthetic data.

  • USAGE on the database and schema that contain the input table, and on the database that contains the output table.

  • CREATE TABLE on the schema that contains the output table.

  • OWNERSHIP on the output tables. The simplest way to do this is by granting OWNERSHIP to the schema where the output table is generated. (However, if someone has applied a FUTURE GRANT on this schema, table ownership will be silently overridden – that is, GRANT OWNERSHIP ON FUTURE TABLES IN SCHEMA db.my_schema TO ROLE some_role will automatically grant OWNERSHIP to some_role on any new tables created in schema my_schema.)

All users can access the SNOWFLAKE.DATA_PRIVACY.GENERATE_SYNTHETIC_DATA stored procedure. Access is made available using the SNOWFLAKE.CORE_VIEWER database role, which is granted to the PUBLIC role.

Usage notes

Examples

This example generates synthetic data from an input table containing medical information (blood type, gender, age, and ethnicity). The response shows the closeness in data between the source and generated tables. The generated synthetic data table is not shown.

Two columns designated as join keys

CALL SNOWFLAKE.DATA_PRIVACY.GENERATE_SYNTHETIC_DATA({
    'datasets':[
        {
          'input_table': 'syndata_db.sch.faker_source_t',
          'output_table': 'syndata_db.sch.faker_synthetic_t',
          'columns': { 'blood_type': {'join_key': TRUE} , 'ethnicity': {'join_key': TRUE}}
        }
      ]
  });
Copy

No columns designated as join keys

CALL SNOWFLAKE.DATA_PRIVACY.GENERATE_SYNTHETIC_DATA({
  'datasets':[
      {
        'input_table': 'syndata_db.sch.faker_source_t',
        'output_table': 'syndata_db.sch.faker_synthetic_t'
      }
    ]
});
Copy

Use consistency key to generate consistent values across multiple runs

CREATE OR REPLACE SECRET my_db.public.consistency_secret
  TYPE = SYMMETRIC_KEY
  ALGORITHM = GENERIC;

CALL SNOWFLAKE.DATA_PRIVACY.GENERATE_SYNTHETIC_DATA({
  'datasets':[
      {
        'input_table': 'CLINICAL_DB.PUBLIC.BASE_TABLE',
        'output_table': 'my_db.public.test_syndata',
        'columns': { 'patient_id': {'join_key': TRUE, 'replace': 'uuid'}}
      }
    ],
    'consistency_secret': 'consistency_secret',
    'replace_output_tables': TRUE
});
Copy

Output from calling the function

+---------------------------+-------------------+--------------+----------------+------------------------+-------------------+---------------------+-----------------------+------------------------+------------------------------------+----------------+
| CREATED_ON                | TABLE_NAME        | TABLE_SCHEMA | TABLE_DATABASE | COLUMNS                | SOURCE_TABLE_NAME | SOURCE_TABLE_SCHEMA | SOURCE_TABLE_DATABASE | SOURCE_COLUMNS         | METRIC_TYPE                        | METRIC_VALUE   |
+---------------------------+-------------------+--------------+----------------+------------------------+-------------------+---------------------+-----------------------+------------------------+------------------------------------+----------------+
| 2024-07-30 09:53:28.439 Z | faker_synthetic_t | sch          | syndata_db     | "BLOOD_TYPE,GENDER"    | faker_source_t    | sch                 | syndata_db            | "BLOOD_TYPE,GENDER"    | CORRELATION_COEFFICIENT_DIFFERENCE | 0.02430214616  |
| 2024-07-30 09:53:28.439 Z | faker_synthetic_t | sch          | syndata_db     | "BLOOD_TYPE,AGE"       | faker_source_t    | sch                 | syndata_db            | "BLOOD_TYPE,AGE"       | CORRELATION_COEFFICIENT_DIFFERENCE | 0.001919343586 |
| 2024-07-30 09:53:28.439 Z | faker_synthetic_t | sch          | syndata_db     | "BLOOD_TYPE,ETHNICITY" | faker_source_t    | sch                 | syndata_db            | "BLOOD_TYPE,ETHNICITY" | CORRELATION_COEFFICIENT_DIFFERENCE | 0.003720197046 |
| 2024-07-30 09:53:28.439 Z | faker_synthetic_t | sch          | syndata_db     | "GENDER,AGE"           | faker_source_t    | sch                 | syndata_db            | "GENDER,AGE"           | CORRELATION_COEFFICIENT_DIFFERENCE | 0.004348586645 |
| 2024-07-30 09:53:28.439 Z | faker_synthetic_t | sch          | syndata_db     | "GENDER,ETHNICITY"     | faker_source_t    | sch                 | syndata_db            | "GENDER,ETHNICITY"     | CORRELATION_COEFFICIENT_DIFFERENCE | 0.001171535243 |
| 2024-07-30 09:53:28.439 Z | faker_synthetic_t | sch          | syndata_db     | "AGE,ETHNICITY"        | faker_source_t    | sch                 | syndata_db            | "AGE,ETHNICITY"        | CORRELATION_COEFFICIENT_DIFFERENCE | 0.004265938158 |
+---------------------------+-------------------+--------------+----------------+------------------------+-------------------+---------------------+-----------------------+------------------------+------------------------------------+----------------+