Using synthetic data in Snowflake¶

This release introduces a new stored procedure, GENERATE_SYNTHETIC_DATA, to generate synthetic data.

Overview¶

Snowflake can generate synthetic data from a source table, producing a table with the same number of columns as the source table, but with statistically similar artificial data. You can use synthetic data to share or test data that is too sensitive, confidential, or otherwise restricted to share with others. The synthetic data set has the same characteristics as the source data set, such as name, number, and data type of columns, and the same or fewer number of rows. You can use synthetic data to test and validate workloads in Snowflake, particularly when the original data is sensitive and should’t be accessible to unauthorized users. Synthetic data appears in the data lineage graph.

Benefits¶

Statistical consistency:: A synthetic data set represents the statistical properties of the original data set, which helps data engineers to understand the statistical properties of the real data set. Subsequently, the data engineer can test and validate solutions that are based on the real data set.
Production validation:: A synthetic data set similar to a production data set enables production engineers to test and validate their production environment. The result is a more robust production environment.

About the synthetic data algorithm¶

Snowflake uses an algorithm to generate synthetic data that is similar to the original data set. The algorithm uses the original data set to generate synthetic data that has the same statistical properties as the original data set. Once this distribution is captured, the synthetic data resembles the original data statistically but does not have a direct reference or link to any row from the original data.

Generating synthetic data¶

Call GENERATE_SYNTHETIC_DATA to generate synthetic data from one or more tables. Snowflake creates synthetic data tables with ownership granted to the role that calls the stored procedure. The output tables have the same number of columns as the input tables, with the same column names and data types. The output generally has the same number of rows, unless you enable the privacy filter, in which case the output tables might have fewer rows.

Generated data values¶

Snowflake generates synthetic data for non-join-key columns according to the source data type:

Statistical data: Data of type number, boolean, date, time, or timestamp. Generated data is the same type, with similar values to the source data.
Categorical string: A string column with few unique values^‡. Generated data uses actual values from the source data.
Non-categorical string: A string column with many unique values^‡. Redacted in the output unless you specify an output format with the replace option in GENERATE_SYNTHETIC_DATA.

You can explicitly designate a non-join-key string column as categorical or non-categorical by providing a categorical value to GENERATE_SYNTHETIC_DATA. Join key columns must be non-categorical strings or statistical.

Generated data in each table maintains the approximate distributions and correlations present in the original table.

Columns designated as join keys can be of any data type, and will result in synthetic data of the same type and consistent, but artificial, values.

^‡ Few unique values means that the number of unique values is less than half the row count. Many unique values means that the number of unique values is more than half the row count.

Maintaining join key consistency in synthetic data¶

If you plan to run join queries on your synthetic data, designate every column that you will join on as a join key. You can designate any numeric, boolean, or non-categorical column as a join key by assigning the join_key value in GENERATE_SYNTHETIC_DATA. A consistent synthetic value is generated in the output data for the same value in the source data for all join keys in all tables during a single run. This enables you to run join queries and get similar results as you would when running the same query against the source data.

To maintain join consistency between tables, be sure that the same join key column in each table has the same arguments. That is, if you expect cust_id to be joinable across tables, provide the same set of arguments and values in the columns description in each dataset object:

'datasets':[
  {
    'input_table': 'd.s.orders',
    'output_table': 'd.s.orders_synth',
    'columns': {'cust_id': {'join_key': True, 'replace': 'uuid'}, ...}
  },
  {
    'input_table': 'd.s.customers',
    'output_table': 'd.s.customers_synth',
    'columns' : {'cust_id': {'join_key': True, 'replace':'uuid'}, ...}

  }
]

Copy

If you provide a symmetric string secret to consistency_secret in GENERATE_SYNTHETIC_DATA, join key values will be consistent across tables and multiple runs. If you do not specify a secret, then the join key values will be consistent across all tables in a single run, but not across multiple runs. Multi-run consistency is supported only for string columns.

Example: Single-run join key consistency

CALL SNOWFLAKE.DATA_PRIVACY.GENERATE_SYNTHETIC_DATA({
  'datasets':[
      {
        'input_table': 'CLINICAL_DB.PUBLIC.PATIENTS1',
        'output_table': 'MY_DB.PUBLIC.PATIENTS1',
        'columns': { 'patient_id': {'join_key': TRUE}, 'age':{'join_key': TRUE}}
      },
      {
        'input_table': 'CLINICAL_DB.PUBLIC.PATIENTS2',
        'output_table': 'MY_DB.PUBLIC.PATIENTS2',
        'columns': { 'patient_id': {'join_key': TRUE}, 'age':{'join_key': TRUE}}
      }
    ],
    'replace_output_tables': TRUE
});

Copy

Example: Multi-run join key consistency

-- Generate consistent join keys across multiple runs by
-- providing a symmetric key secret.
CREATE OR REPLACE SECRET my_db.public.my_consistency_secret
  TYPE=SYMMETRIC_KEY
  ALGORITHM=GENERIC;

CALL SNOWFLAKE.DATA_PRIVACY.GENERATE_SYNTHETIC_DATA({
  'datasets':[
      {
        'input_table': 'CLINICAL_DB.PUBLIC.BASE_TABLE',
        'output_table': 'MY_DB.PUBLIC.PATIENTS1',
        'columns': { 'patient_id': {'join_key': TRUE}}
      }
    ],
    'consistency_secret': SYSTEM$REFERENCE('SECRET', 'MY_CONSISTENCY_SECRET', 'SESSION', 'READ')::STRING,
    'replace_output_tables': TRUE
});

CALL SNOWFLAKE.DATA_PRIVACY.GENERATE_SYNTHETIC_DATA({
  'datasets':[
      {
        'input_table': 'CLINICAL_DB.PUBLIC.SECOND_TABLE',
        'output_table': 'MY_DB.PUBLIC.PATIENTS2',
        'columns': { 'patient_id': {'join_key': TRUE}}
      }
    ],
    'consistency_secret': SYSTEM$REFERENCE('SECRET', 'MY_CONSISTENCY_SECRET', 'SESSION', 'READ')::STRING,
    'replace_output_tables': TRUE
});

Copy

Enhancing privacy¶

When you call the GENERATE_SYNTHETIC_DATA stored procedure, you can optionally set the 'similarity_filter': True configuration option to apply a privacy filter to the output table. The privacy filter removes rows from the output table if the rows are too similar to the input data set. The privacy threshold uses the nearest neighbor distance ratio (NNDR) and distance to closest record (DCR) values to determine whether a row should be removed from the output table.

When using a similarity filter, all non-string columns must have values for all rows. A NULL value in a non-string column will cause the procedure to fail.

Requirements¶

Input table requirements¶

Both tables and views are supported as source data. You can specify up to five input tables per procedure call.

To generate synthetic data, each input table or view must meet the following requirements:

Minimum 20 distinct rows
Maximum 100 columns
Maximum 14M rows
The following input table types are supported:
- Regular, temporary, dynamic, and transient tables
- Regular, materialized, secure, and secure materialized views
The following input table types are not supported:
- External, Apache Iceberg™, and hybrid tables
- Streams
The following column types are supported. Columns of an unsupported data type return NULL for all values in the column.
- All numeric types (NUMBER, DECIMAL, FLOAT, INTEGER, and so on)
- BOOLEAN
- All date and time types (DATE, DATETIME, TIME, TIMESTAMP, and so on) except TIMESTAMP_TZ. However, timestamps earlier than 1677-09-21 00:12:43.145224193 or later than 2262-04-11 23:47:16.854775807 in the source data are coerced to 1677-09-21 00:12:43.145224193 or 2262-04-11 23:47:16.854775807 respectively when generating synthetic data.
- STRING, VARCHAR, CHAR, CHARACTER, TEXT
  
  If more than half of the values in a STRING column are unique values, Snowflake replaces the value with a redacted value in the output table due to privacy concerns.

Access control requirements¶

To generate synthetic data, you must use a role with each the following grants:

USAGE on the warehouse that you want to use for queries.
SELECT on the input table from which you want to generate synthetic data.
USAGE on the database and schema that contain the input table, and on the database that contains the output table.
CREATE TABLE on the schema that contains the output table.
OWNERSHIP on the output tables. The simplest way to do this is by granting OWNERSHIP to the schema where the output table is generated. (However, if someone has applied a FUTURE GRANT on this schema, table ownership will be silently overridden – that is, GRANT OWNERSHIP ON FUTURE TABLES IN SCHEMA db.my_schema TO ROLE some_role will automatically grant OWNERSHIP to some_role on any new tables created in schema my_schema.)

All users can access the SNOWFLAKE.DATA_PRIVACY.GENERATE_SYNTHETIC_DATA stored procedure. Access is made available using the SNOWFLAKE.CORE_VIEWER database role, which is granted to the PUBLIC role.

Other requirements¶

You must accept the Anaconda terms and conditions in your Snowflake account in order to enable this feature.

Recommendations¶

Use a medium Snowpark-optimized warehouse.
While GENERATE_SYNTHETIC_DATA is running, do not run any other queries in that warehouse.

Example: Synthetic data from multiple tables¶

This example uses the Snowflake Sample Data database SNOWFLAKE_SAMPLE_DATA. If you don’t see it in your account, you can copy it with the following commands:

USE ROLE ACCOUNTADMIN;
CREATE or REPLACE DATABASE SNOWFLAKE_SAMPLE_DATA from share SFC_SAMPLES.SAMPLE_DATA;

Copy

Follow these steps to generate synthetic data from multiple input table:

Create and configure the access control for the data_engineer role to allow them to create all of the necessary objects:

USE ROLE ACCOUNTADMIN;
CREATE OR REPLACE ROLE data_engineer;
CREATE OR REPLACE DATABASE syndata_db;
CREATE OR REPLACE WAREHOUSE syndata_wh;

GRANT OWNERSHIP ON DATABASE syndata_db TO ROLE data_engineer;
GRANT USAGE ON WAREHOUSE syndata_wh TO ROLE data_engineer;
GRANT ROLE data_engineer TO USER jsmith; -- Or whoever you want to run this example. Or skip this line to run it yourself.

Copy

Create two views from the Snowflake Sample Data database:

- Sign in as user with data_engineer role. Then...
CREATE SCHEMA syndata_db.sch;
CREATE OR REPLACE VIEW syndata_db.sch.TPC_ORDERS_5K as (
    SELECT * from SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.ORDERS
    LIMIT 5000
);
CREATE OR REPLACE VIEW syndata_db.sch.TPC_CUSTOMERS_5K as (
    SELECT * from SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.CUSTOMER
    LIMIT 5000
);

Copy

Query the input tables to view the data and confirm that each table has 5,000 rows:

USE WAREHOUSE syndata_wh;
SELECT TOP 20 * FROM syndata_db.sch.TPC_ORDERS_5K;
SELECT COUNT(*) FROM syndata_db.sch.TPC_ORDERS_5K;
select count(distinct o_clerk), count(*) from syndata_db.sch.TPC_ORDERS_5K;

SELECT TOP 20 * FROM syndata_db.sch.TPC_CUSTOMERS_5K;
SELECT COUNT(*) FROM syndata_db.sch.TPC_CUSTOMERS_5K;

Copy

Call the GENERATE_SYNTHETIC_DATA stored procedure to generate the synthetic data into two output tables. Designate join keys, because you will join on those keys later.

CALL SNOWFLAKE.DATA_PRIVACY.GENERATE_SYNTHETIC_DATA({
    'datasets':[
        {
          'input_table': 'syndata_db.sch.TPC_ORDERS_5K',
          'output_table': 'syndata_db.sch.TPC_ORDERS_5K_SYNTHETIC',
          'columns': {'O_CUSTKEY': {'join_key': True}}
        },
        {
          'input_table': 'syndata_db.sch.TPC_CUSTOMERS_5K',
          'output_table': 'syndata_db.sch.TPC_CUSTOMERS_5K_SYNTHETIC',
          'columns' : {'C_CUSTKEY': {'join_key': True}}

        }
      ],
      'replace_output_tables':True
  });

Copy

Query the output table to view the synthetic data:

SELECT TOP 20 * FROM syndata_db.sch.TPC_ORDERS_5K_SYNTHETIC;
SELECT COUNT(*) FROM syndata_db.sch.TPC_ORDERS_5K_SYNTHETIC;

SELECT TOP 20 * FROM syndata_db.sch.TPC_CUSTOMERS_5K_SYNTHETIC;
SELECT COUNT(*) FROM syndata_db.sch.TPC_CUSTOMERS_5K_SYNTHETIC;

Copy

Clean up all the objects

USE ROLE ACCOUNTADMIN;
DROP DATABASE syndata_db;
DROP ROLE data_engineer;
DROP WAREHOUSE syndata_wh;

Copy