Using synthetic data in Snowflake

This preview introduces a new stored procedure, GENERATE_SYNTHETIC_DATA, to generate synthetic data.

Overview

Synthetic data is data that is generated programmatically to serve as a substitute for real data. You can use synthetic data to represent data that is sensitive, confidential, or otherwise restricted. The synthetic data set has characteristics of the data set it represents, such as numbers of columns, column names, the data type of each column, and synthetically generated data values for each column. A data engineer can use synthetic data to test and validate workloads in Snowflake, particularly when the original data is sensitive and should not be accessible to unauthorized users.

In Snowflake, you can call the SNOWFLAKE.DATA_PRIVACY.GENERATE_SYNTHETIC_DATA stored procedure to generate synthetic data from up to five input tables. You specify an output table to store the synthetic data. Snowflake automatically creates and grants access to the output table to the role that calls the stored procedure. If you need to generate the output table again, you can overwrite the existing output table by modifying the stored procedure parameters.

Benefits

Statistical consistency:

A data set that represents the statistical properties of the original data set helps data engineers to understand the statistical properties of the real data set. Subsequently, the data engineer can test and validate solutions that are based on the real data set.

Production validation:

A synthetic data set that is similar to a production data set enables production engineers to test and validate their production environment. The result is a more robust production environment.

Filtering the synthetic data

When you call the GENERATE_SYNTHETIC_DATA stored procedure, you can optionally set the 'privacy_filter': True configuration option to apply a privacy filter to the output table. The privacy filter removes rows from the output table if the rows are too similar to the input data set. The privacy threshold uses the nearest neighbor distance ratio (NNDR) and distance to closest record (DCR) values to determine whether a row should be removed from the output table.

About the synthetic data algorithm

Snowflake uses an algorithm to generate synthetic data that is similar to the original data set. The algorithm uses the original data set to generate synthetic data that has the same statistical properties as the original data set. The algorithm leverages copulas to capture the distribution of data both within and across columns. Once this distribution is captured, the synthetic data resembles the original data statistically but does not have a direct reference or link to any row from the original data.

If you have multiple input tables, you can specify join keys, which are columns in the input table. The algorithm collects the specified join keys across the input tables, converts them to integers, casts them back to their original format, and reattaches them to the synthesized tables.

Requirements

Input table requirements

To generate synthetic data, each input table that you specify must meet the following requirements:

  • Tables and views.

  • A minimum of 20 distinct rows.

  • The columns in the input table can have the following data types. Columns with an unsupported data type return null for all values in the column:

    • All numeric types (NUMBER, DECIMAL, FLOAT, INTEGER, and so on)

    • BOOLEAN

    • All date and time types (DATE, DATETIME, TIME, TIMESTAMP, and so on) except TIMESTAMP_TZ

    • STRING, VARCHAR, CHAR, CHARACTER, TEXT

      If more than half of the values in a STRING column are unique values, Snowflake replaces the value with a redacted value in the output table due to privacy concerns.

Additionally, make sure that your input table does not exceed these limits:

  • 5 total input tables.

  • 100 columns per input table.

  • 2.3M rows

Access control requirements

To generate synthetic data, you must use a role with each the following grants:

  • USAGE on the warehouse that you want to use for queries.

  • SELECT on the input table from which you want to generate synthetic data.

  • USAGE on the database and schema that contain the input table, and on the database that contains the output table.

  • CREATE TABLE on the schema that contains the output table.

  • OWNERSHIP on the output tables. The simplest way to do this is by granting OWNERSHIP to the schema where the output table is generated. (However, if someone has applied a FUTURE GRANT on this schema, table ownership will be silently overridden – that is, GRANT OWNERSHIP ON FUTURE TABLES IN SCHEMA db.my_schema TO ROLE some_role will automatically grant OWNERSHIP to some_role on any new tables created in schema my_schema.)

All users can access the SNOWFLAKE.DATA_PRIVACY.GENERATE_SYNTHETIC_DATA stored procedure. Access is made available using the SNOWFLAKE.CORE_VIEWER database role, which is granted to the PUBLIC role.

Other requirements

You must accept the Anaconda terms and conditions in your Snowflake account in order to enable this feature.

Recommendations

  • Use a medium Snowpark-optimized warehouse. While GENERATE_SYNTHETIC_DATA is running, do not run any other queries in that warehouse.

  • Specify up to 5 input tables in a single call of the SNOWFLAKE.DATA_PRIVACY.GENERATE_SYNTHETIC_DATA stored procedure.

  • Choose proper input tables. For more information, see Requirements (in this topic).

Example: synthetic data from multiple tables

This example uses the Snowflake Sample Data database, which has been shared with nearly all Snowflake customers.

Follow these steps to generate synthetic data from multiple input table:

  1. Configure the access control for the data_engineer role and allow them to create all of the necessary objects:

    USE ROLE ACCOUNTADMIN;
    CREATE OR REPLACE ROLE data_engineer;
    CREATE OR REPLACE DATABASE syndata_db;
    CREATE OR REPLACE WAREHOUSE syndata_wh;
    
    GRANT OWNERSHIP ON DATABASE syndata_db TO ROLE data_engineer;
    GRANT USAGE ON WAREHOUSE syndata_wh TO ROLE data_engineer;
    GRANT ROLE data_engineer TO USER jsmith;
    
    Copy
  2. Create views from the Snowflake Sample Data database:

    CREATE SCHEMA syndata_db.sch;
    CREATE OR REPLACE VIEW syndata_db.sch.TPC_ORDERS_5K as (
        SELECT * from SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.ORDERS
        LIMIT 5000
    );
    CREATE OR REPLACE VIEW syndata_db.sch.TPC_CUSTOMERS_5K as (
        SELECT * from SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.CUSTOMER
        LIMIT 5000
    );
    
    Copy
  3. Query the input tables to view the data and confirm each table has 5000 rows:

    USE WAREHOUSE syndata_wh;
    SELECT * FROM syndata_db.sch.TPC_ORDERS_1K;
    
    Copy
  4. Call the GENERATE_SYNTHETIC_DATA stored procedure to generate the synthetic data and store it in an output table:

    CALL SNOWFLAKE.DATA_PRIVACY.GENERATE_SYNTHETIC_DATA({
        'datasets':[
            {
              'input_table': 'syndata_db.sch.TPC_ORDERS_5K',
              'output_table': 'syndata_db.sch.TPC_ORDERS_5K_SYNTHETIC',
              'columns': {'O_ORDERKEY': {'join_key': True}, 'O_CUSTKEY': {'join_key': True}}
            },
            {
              'input_table': 'syndata_db.sch.TPC_CUSTOMERS_5K',
              'output_table': 'syndata_db.sch.TPC_CUSTOMERS_5K_SYNTHETIC',
              'columns' : {'C_CUSTKEY': {'join_key': True}}
    
            }
          ],
          'replace_output_tables':True
      });
    
    Copy
  5. Query the output table to view the synthetic data:

    SELECT * FROM syndata_db.sch.TPC_ORDERS_5K_SYNTHETIC;
    
    Copy