GENERATE_SYNTHETIC_DATA¶
Generates synthetic data in an output table based on sample data from an input table. You can generate more than one output table per run.
The procedure returns a table that contains metrics about the generated data, such as the coefficient of difference (similarity) between the source data and the generated data.
This stored procedure uses caller’s rights to generate the output table.
Syntax¶
SNOWFLAKE.DATA_PRIVACY.GENERATE_SYNTHETIC_DATA(
{
'datasets': [
{
'input_table': '<input_table_name>',
'output_table' : '<output_table_name>',
[ 'columns': { '<column_name>': {join_key:[TRUE | FALSE]} [, '<column_name>': {join_key:[TRUE | FALSE]} , ... ]]
}
[ ,
'input_table': '<input_table_name>',
'output_table' : '<output_table_name>',
[ 'columns': { '<column_name>': {join_key:[TRUE | FALSE]} [, '<column_name>': {join_key:[TRUE | FALSE]} , ... ]]
, ... ]
]
[ , 'privacy_filter': <boolean> ]
[ , 'replace_output_tables': <boolean> ]
}
)
Required parameters¶
'input_table_name'
The name of the input table from which to generate synthetic data.
If the identifier contains spaces or special characters, the entire string must be enclosed in double quotes. Identifiers enclosed in double quotes are also case-sensitive.
For more information, see Identifier requirements.
You can specify the fully-qualified name or the relative name of the table. If using a relative name, check the database and schema that are in use for the session. External and Iceberg tables are not supported.
If the table does not exist or cannot be accessed, Snowflake returns an error message. See Using synthetic data in Snowflake for more input table requirements.
'output_table_name'
The fully-qualified name of the output table to store the synthetic data. The generated table will have the same permissions and policies as if the user had called CREATE TABLE with default values.
In addition, the identifier must start with an alphabetic character and cannot contain spaces or special characters unless the entire identifier string is enclosed in double quotes (for example,
"My object"
). Identifiers enclosed in double quotes are also case-sensitive.For more information, see Identifier requirements.
You can specify the fully qualified name or the relative name of the output table. If using a relative name, check the database and schema that are in use for the session prior to calling this stored procedure.
If the output table already exists, specify
'replace_output_tables': True
to overwrite the output table, otherwise an error message occurs.
Optional parameters¶
[ 'column_name' ] [ , 'column_name' , ... ]
A list of column names in the input table that serve as join keys. Data in columns specified as join keys be label encoded: it will be converted into a simple label, such as “1”, “2”, “3”. The format and cardinality (number of values) of join keys will be consistent for all joined columns so that you can execute joins on the specified columns in the generated tables.
Every specified column must exist in the input table.
Default:
[]
(empty list)'privacy_filter': {boolean}
Specifies whether to use a privacy filter when creating the synthetic data.
True
Uses the built-in privacy filter to remove rows from the target table that are too similar to the input table.
False
Does not use the built-in privacy filter to remove rows from the output table.
Default:
False
For more information, see Input table requirements.
'replace_output_tables': {boolean}
Specifies whether to overwrite the output synthetic data table when creating the synthetic data.
True
Overwrites the output table.
False
Does not overwrite the output table.
Default:
False
Output¶
Column Name |
Data Type |
Description |
---|---|---|
|
TIMESTAMP |
Time the synthetic data was generated. |
|
VARCHAR |
Name of the synthetic table. |
|
VARCHAR |
Schema name of the synthetic table. |
|
VARCHAR |
Database name of the synthetic table. |
|
VARCHAR |
A pair of columns in the synthetic table. |
|
NUMBER |
Internal/system-generated identifier of the input table. |
|
VARCHAR |
Name of the input table. |
|
VARCHAR |
Schema name of the input table. |
|
VARCHAR |
Database name of the input table. |
|
VARCHAR |
Names of the source columns. |
|
ENUM |
Currently, |
|
NUMBER |
Value of the metric. |
Access control requirements¶
To generate synthetic data, you must use a role with each the following grants:
USAGE on the warehouse that you want to use for queries.
SELECT on the input table from which you want to generate synthetic data.
USAGE on the database and schema that contain the input table, and on the database that contains the output table.
CREATE TABLE on the schema to contain the output table.
All users can access the SNOWFLAKE.DATA_PRIVACY.GENERATE_SYNTHETIC_DATA stored procedure. Access is made available using the SNOWFLAKE.CORE_VIEWER database role, which is granted to the PUBLIC role.
Usage notes¶
The JSON key values must be lowercase.
You must accept the Anaconda terms and conditions in your Snowflake account in order to enable this feature.
For additional requirements, see Example: synthetic data from multiple tables.
Example¶
This example generates synthetic data from an input table containing medical information (blood type, gender, age, and ethnicity). The response shows the closeness in data between the source and generated tables. The generated synthetic data table is not shown.
CALL SNOWFLAKE.DATA_PRIVACY.GENERATE_SYNTHETIC_DATA({
'datasets':[
{
'input_table': 'syndata_db.sch.faker_source_t',
'output_table': 'syndata_db.sch.faker_synthetic_t',
'columns': {}
}
]
});
(The previous example also could have omitted the columns
field entirely.)
+---------------------------+-------------------+--------------+----------------+------------------------+-------------------+---------------------+-----------------------+------------------------+------------------------------------+----------------+
| CREATED_ON | TABLE_NAME | TABLE_SCHEMA | TABLE_DATABASE | COLUMNS | SOURCE_TABLE_NAME | SOURCE_TABLE_SCHEMA | SOURCE_TABLE_DATABASE | SOURCE_COLUMNS | METRIC_TYPE | METRIC_VALUE |
+---------------------------+-------------------+--------------+----------------+------------------------+-------------------+---------------------+-----------------------+------------------------+------------------------------------+----------------+
| 2024-07-30 09:53:28.439 Z | faker_synthetic_t | sch | syndata_db | "BLOOD_TYPE,GENDER" | faker_source_t | sch | syndata_db | "BLOOD_TYPE,GENDER" | CORRELATION_COEFFICIENT_DIFFERENCE | 0.02430214616 |
| 2024-07-30 09:53:28.439 Z | faker_synthetic_t | sch | syndata_db | "BLOOD_TYPE,AGE" | faker_source_t | sch | syndata_db | "BLOOD_TYPE,AGE" | CORRELATION_COEFFICIENT_DIFFERENCE | 0.001919343586 |
| 2024-07-30 09:53:28.439 Z | faker_synthetic_t | sch | syndata_db | "BLOOD_TYPE,ETHNICITY" | faker_source_t | sch | syndata_db | "BLOOD_TYPE,ETHNICITY" | CORRELATION_COEFFICIENT_DIFFERENCE | 0.003720197046 |
| 2024-07-30 09:53:28.439 Z | faker_synthetic_t | sch | syndata_db | "GENDER,AGE" | faker_source_t | sch | syndata_db | "GENDER,AGE" | CORRELATION_COEFFICIENT_DIFFERENCE | 0.004348586645 |
| 2024-07-30 09:53:28.439 Z | faker_synthetic_t | sch | syndata_db | "GENDER,ETHNICITY" | faker_source_t | sch | syndata_db | "GENDER,ETHNICITY" | CORRELATION_COEFFICIENT_DIFFERENCE | 0.001171535243 |
| 2024-07-30 09:53:28.439 Z | faker_synthetic_t | sch | syndata_db | "AGE,ETHNICITY" | faker_source_t | sch | syndata_db | "AGE,ETHNICITY" | CORRELATION_COEFFICIENT_DIFFERENCE | 0.004265938158 |
+---------------------------+-------------------+--------------+----------------+------------------------+-------------------+---------------------+-----------------------+------------------------+------------------------------------+----------------+