Using synthetic data in Snowflake

This preview introduces a new stored procedure, GENERATE_SYNTHETIC_DATA, to generate synthetic data.

Overview

Snowflake can generate synthetic data from a source table, producing a table with the same number of columns as the source table, but with artificial data. You can use synthetic data to share or test data that is too sensitive, confidential, or otherwise restricted to share with others. The synthetic data set has the same characteristics as the source data set, such as name, number, and data type of columns, and the same or fewer number of rows. You can use synthetic data to test and validate workloads in Snowflake, particularly when the original data is sensitive and should not be accessible to unauthorized users.

Benefits

Statistical consistency:

A synthetic data set represents the statistical properties of the original data set, which helps data engineers to understand the statistical properties of the real data set. Subsequently, the data engineer can test and validate solutions that are based on the real data set.

Production validation:

A synthetic data set similar to a production data set enables production engineers to test and validate their production environment. The result is a more robust production environment.

About the synthetic data algorithm

Snowflake uses an algorithm to generate synthetic data that is similar to the original data set. The algorithm uses the original data set to generate synthetic data that has the same statistical properties as the original data set. The algorithm leverages copulas to capture the distribution of data both within and across columns. Once this distribution is captured, the synthetic data resembles the original data statistically but does not have a direct reference or link to any row from the original data.

Snowflake categorizes source columns as one of the following types, which affects how synthetic data is generated:

  • Join key Can be any data type. Join key columns are declared explicitly by the user. A consistent synthetic value is generated in the output data for the same value in the source data for all join keys in all tables during a single run. This enables you to run join queries and get similar results as you would when running the same query against the source data.

  • Statistical data: Data of type number, boolean, date, time, or timestamp. Artificial values of the same type are generated, with a similar statistical distribution in the output data.

  • Categorical string A string column with few unique values*. Actual values are used in the generated data with a statistical distribution similar to the source data.

  • Non-categorical string A string column with many unique values*. Redacted in the output.

* Few unique values means that the number of unique values is less than half the row count. Many unique values means that the number of unique values is more than half the row count.

Generating synthetic data

Call the SNOWFLAKE.DATA_PRIVACY.GENERATE_SYNTHETIC_DATA stored procedure to generate synthetic data from one or more tables. Snowflake automatically creates tables with ownership granted to the role that calls the stored procedure. The output tables have the same number of columns as the input tables, with the same column names and data types. The output generally has the same number of rows, unless you enable the privacy filter or the input table has null values, in which case the output tables might have fewer rows.

Running join queries on synthetic data

If you plan to run join queries on your synthetic data, designate every column that you will join on as a join key. The synthetic generator will produce the same synthetic value from the same source value in all columns marked as join keys. That is, in all join key columns, the source value “Europe” will result in the same synthetic value in the generated tables.

If you plan to run multi-table join queries on the synthetic data, run GENERATE_SYNTHETIC_DATA with all tables that you plan to use in your query, specifying which columns will be used as join keys. Columns designated as join keys will have a consistent synthetic value for the same source value in all tables, which makes these queries behave similarly on synthetic data and the original data.

Join keys generate consistent synthetic values within a single run. Running the procedure a second time will not generate the same key values for join keys, even when re-running an identical request.

Filtering the synthetic data

When you call the GENERATE_SYNTHETIC_DATA stored procedure, you can optionally set the 'privacy_filter': True configuration option to apply a privacy filter to the output table. The privacy filter removes rows from the output table if the rows are too similar to the input data set. The privacy threshold uses the nearest neighbor distance ratio (NNDR) and distance to closest record (DCR) values to determine whether a row should be removed from the output table.

Requirements

Input table requirements

Both tables and views are supported as source data. You can specify up to five input tables per procedure call.

To generate synthetic data, each input table or view must meet the following requirements:

  • Minimum 20 distinct rows

  • Maximum 100 columns

  • Maximum 2.3M rows

  • The following data types are supported. Columns of an unsupported data type return NULL for all values in the column.

    • All numeric types (NUMBER, DECIMAL, FLOAT, INTEGER, and so on)

    • BOOLEAN

    • All date and time types (DATE, DATETIME, TIME, TIMESTAMP, and so on) except TIMESTAMP_TZ

    • STRING, VARCHAR, CHAR, CHARACTER, TEXT

      If more than half of the values in a STRING column are unique values, Snowflake replaces the value with a redacted value in the output table due to privacy concerns.

Access control requirements

To generate synthetic data, you must use a role with each the following grants:

  • USAGE on the warehouse that you want to use for queries.

  • SELECT on the input table from which you want to generate synthetic data.

  • USAGE on the database and schema that contain the input table, and on the database that contains the output table.

  • CREATE TABLE on the schema that contains the output table.

  • OWNERSHIP on the output tables. The simplest way to do this is by granting OWNERSHIP to the schema where the output table is generated. (However, if someone has applied a FUTURE GRANT on this schema, table ownership will be silently overridden – that is, GRANT OWNERSHIP ON FUTURE TABLES IN SCHEMA db.my_schema TO ROLE some_role will automatically grant OWNERSHIP to some_role on any new tables created in schema my_schema.)

All users can access the SNOWFLAKE.DATA_PRIVACY.GENERATE_SYNTHETIC_DATA stored procedure. Access is made available using the SNOWFLAKE.CORE_VIEWER database role, which is granted to the PUBLIC role.

Other requirements

You must accept the Anaconda terms and conditions in your Snowflake account in order to enable this feature.

Recommendations

Example: Synthetic data from multiple tables

This example uses the Snowflake Sample Data database SNOWFLAKE_SAMPLE_DATA. If you don’t see it in your account, you can copy it with the following commands:

USE ROLE ACCOUNTADMIN;
CREATE or REPLACE DATABASE SNOWFLAKE_SAMPLE_DATA from share SFC_SAMPLES.SAMPLE_DATA;
Copy

Follow these steps to generate synthetic data from multiple input table:

  1. Create and configure the access control for the data_engineer role to allow them to create all of the necessary objects:

    USE ROLE ACCOUNTADMIN;
    CREATE OR REPLACE ROLE data_engineer;
    CREATE OR REPLACE DATABASE syndata_db;
    CREATE OR REPLACE WAREHOUSE syndata_wh;
    
    GRANT OWNERSHIP ON DATABASE syndata_db TO ROLE data_engineer;
    GRANT USAGE ON WAREHOUSE syndata_wh TO ROLE data_engineer;
    GRANT ROLE data_engineer TO USER jsmith; -- Or whoever you want to run this example. Or skip this line to run it yourself.
    
    Copy
  2. Create two views from the Snowflake Sample Data database:

    - Sign in as user with data_engineer role. Then...
    CREATE SCHEMA syndata_db.sch;
    CREATE OR REPLACE VIEW syndata_db.sch.TPC_ORDERS_5K as (
        SELECT * from SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.ORDERS
        LIMIT 5000
    );
    CREATE OR REPLACE VIEW syndata_db.sch.TPC_CUSTOMERS_5K as (
        SELECT * from SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.CUSTOMER
        LIMIT 5000
    );
    
    Copy
  3. Query the input tables to view the data and confirm that each table has 5,000 rows:

    USE WAREHOUSE syndata_wh;
    SELECT TOP 20 * FROM syndata_db.sch.TPC_ORDERS_5K;
    SELECT COUNT(*) FROM syndata_db.sch.TPC_ORDERS_5K;
    select count(distinct o_clerk), count(*) from syndata_db.sch.TPC_ORDERS_5K;
    
    SELECT TOP 20 * FROM syndata_db.sch.TPC_CUSTOMERS_5K;
    SELECT COUNT(*) FROM syndata_db.sch.TPC_CUSTOMERS_5K;
    
    Copy
  4. Call the GENERATE_SYNTHETIC_DATA stored procedure to generate the synthetic data into two output tables. Designate join keys, because you will join on those keys later.

    CALL SNOWFLAKE.DATA_PRIVACY.GENERATE_SYNTHETIC_DATA({
        'datasets':[
            {
              'input_table': 'syndata_db.sch.TPC_ORDERS_5K',
              'output_table': 'syndata_db.sch.TPC_ORDERS_5K_SYNTHETIC',
              'columns': {'O_CUSTKEY': {'join_key': True}}
            },
            {
              'input_table': 'syndata_db.sch.TPC_CUSTOMERS_5K',
              'output_table': 'syndata_db.sch.TPC_CUSTOMERS_5K_SYNTHETIC',
              'columns' : {'C_CUSTKEY': {'join_key': True}}
    
            }
          ],
          'replace_output_tables':True
      });
    
    Copy
  5. Query the output table to view the synthetic data:

    SELECT TOP 20 * FROM syndata_db.sch.TPC_ORDERS_5K_SYNTHETIC;
    SELECT COUNT(*) FROM syndata_db.sch.TPC_ORDERS_5K_SYNTHETIC;
    
    SELECT TOP 20 * FROM syndata_db.sch.TPC_CUSTOMERS_5K_SYNTHETIC;
    SELECT COUNT(*) FROM syndata_db.sch.TPC_CUSTOMERS_5K_SYNTHETIC;
    
    Copy
  6. Clean up all the objects

    USE ROLE ACCOUNTADMIN;
    DROP DATABASE syndata_db;
    DROP ROLE data_engineer;
    DROP WAREHOUSE syndata_wh;
    
    Copy