GENERATE_SYNTHETIC_DATA¶
The procedure generates synthetic data from one or more tables, based on data from input tables, and returns a table that contains metrics about the generated data, such as the coefficient of difference (similarity) between the source data and the generated data.
This stored procedure uses the caller’s rights to generate the output table.
Read the requirements for running this procedure. If any requirements are not met, the request will fail before it starts generating data.
Learn more about synthetic data usage.
Syntax¶
SNOWFLAKE.DATA_PRIVACY.GENERATE_SYNTHETIC_DATA(<configuration_object>)
Arguments¶
configuration_object
An OBJECT value that specifies the details of the request. You can use an OBJECT constant to specify this object.
The OBJECT value has the following structure:
{ 'datasets': [ { 'input_table': '<input_table_name>', 'output_table' : '<output_table_name>', 'columns': { '<column_name>': { <property_name>: <property_value> } , ... } } , ... ] 'privacy_filter': <boolean>, 'replace_output_tables': <boolean>, }
The OBJECT value contains the following key-value pairs:
datasets
An ARRAY value specifying the data to generate. Each element in the array is an OBJECT value that describes information about a single source/generated table pair. You can specify up to five table pairs. The request will fail if you provide more than five
datasets`
objects.Each OBJECT value has the following properties:
input_table
The fully-qualified name of the input table from which to generate synthetic data. If the table does not exist or cannot be accessed, Snowflake returns an error message. See Using synthetic data in Snowflake for more input table requirements.
If the identifier contains spaces or special characters, the entire string must be enclosed in double quotes. Identifiers enclosed in double quotes are also case-sensitive.
For more information, see Identifier requirements.
output_table
The fully-qualified name of the output table to store the synthetic data generated from
input_table
. The generated table will have the same permissions and policies as if the user had called CREATE TABLE with default values. If the table already exists andreplace_output_tables=TRUE
, the existing table will be overwritten.In addition, the identifier must start with an alphabetic character and cannot contain spaces or special characters unless the entire identifier string is enclosed in double quotes (for example,
"My object"
). Identifiers enclosed in double quotes are also case-sensitive.For more information, see Identifier requirements.
columns
(Optional) An OBJECT specifying additional properties for specific columns. Each field in the OBJECT defines properties for a single column. You do not need to define properties for all columns, or any columns at all. For each field:
The key is the column name. Properties in the value should be applied to this column.
The value is an OBJECT containing the following key/value pair:
join_key
: Set to TRUE or FALSE, where TRUE indicates that this is a join key column. Learn about join keysDefault: FALSE.
privacy_filter
(Optional) Specifies whether to use a similarity filter when creating the synthetic data. Set this to TRUE to use the built-in privacy filter to remove rows from the target table that are too similar to rows in the input table. If FALSE, each output table will have the same number of rows as its input table; if TRUE, an output table might have fewer rows than its input table.
Default: FALSE
For more information, see Filtering the synthetic data.
replace_output_tables
(Optional) Specifies whether to overwrite the output synthetic data table when creating the synthetic data. Set this to TRUE to overwrite the output table.
Default: FALSE
Output¶
Column Name |
Data Type |
Description |
---|---|---|
|
TIMESTAMP |
Time the synthetic data was generated. |
|
VARCHAR |
Name of the synthetic table. |
|
VARCHAR |
Schema name of the synthetic table. |
|
VARCHAR |
Database name of the synthetic table. |
|
VARCHAR |
A pair of columns in the synthetic table. |
|
NUMBER |
Internal/system-generated identifier of the input table. |
|
VARCHAR |
Name of the input table. |
|
VARCHAR |
Schema name of the input table. |
|
VARCHAR |
Database name of the input table. |
|
VARCHAR |
Names of the source columns. |
|
ENUM |
Currently, |
|
NUMBER |
Value of the metric. |
Access control requirements¶
To generate synthetic data, you must use a role with each the following grants:
USAGE on the warehouse that you want to use for queries.
SELECT on the input table from which you want to generate synthetic data.
USAGE on the database and schema that contain the input table, and on the database that contains the output table.
CREATE TABLE on the schema that contains the output table.
OWNERSHIP on the output tables. The simplest way to do this is by granting OWNERSHIP to the schema where the output table is generated. (However, if someone has applied a FUTURE GRANT on this schema, table ownership will be silently overridden – that is,
GRANT OWNERSHIP ON FUTURE TABLES IN SCHEMA db.my_schema TO ROLE some_role
will automatically grant OWNERSHIP tosome_role
on any new tables created in schemamy_schema
.)
All users can access the SNOWFLAKE.DATA_PRIVACY.GENERATE_SYNTHETIC_DATA stored procedure. Access is made available using the SNOWFLAKE.CORE_VIEWER database role, which is granted to the PUBLIC role.
Usage notes¶
The JSON key values must be lowercase.
You must accept the Anaconda terms and conditions in your Snowflake account in order to enable this feature.
For additional requirements, see Requirements.
Examples¶
This example generates synthetic data from an input table containing medical information (blood type, gender, age, and ethnicity). The response shows the closeness in data between the source and generated tables. The generated synthetic data table is not shown.
Two columns designated as join keys
CALL SNOWFLAKE.DATA_PRIVACY.GENERATE_SYNTHETIC_DATA({
'datasets':[
{
'input_table': 'syndata_db.sch.faker_source_t',
'output_table': 'syndata_db.sch.faker_synthetic_t',
'columns': { 'blood_type': {'join_key': TRUE} , 'ethnicity': {'join_key': TRUE}}
}
]
});
No columns designated as join keys
CALL SNOWFLAKE.DATA_PRIVACY.GENERATE_SYNTHETIC_DATA({
'datasets':[
{
'input_table': 'syndata_db.sch.faker_source_t',
'output_table': 'syndata_db.sch.faker_synthetic_t',
}
]
});
Output from calling the function
+---------------------------+-------------------+--------------+----------------+------------------------+-------------------+---------------------+-----------------------+------------------------+------------------------------------+----------------+
| CREATED_ON | TABLE_NAME | TABLE_SCHEMA | TABLE_DATABASE | COLUMNS | SOURCE_TABLE_NAME | SOURCE_TABLE_SCHEMA | SOURCE_TABLE_DATABASE | SOURCE_COLUMNS | METRIC_TYPE | METRIC_VALUE |
+---------------------------+-------------------+--------------+----------------+------------------------+-------------------+---------------------+-----------------------+------------------------+------------------------------------+----------------+
| 2024-07-30 09:53:28.439 Z | faker_synthetic_t | sch | syndata_db | "BLOOD_TYPE,GENDER" | faker_source_t | sch | syndata_db | "BLOOD_TYPE,GENDER" | CORRELATION_COEFFICIENT_DIFFERENCE | 0.02430214616 |
| 2024-07-30 09:53:28.439 Z | faker_synthetic_t | sch | syndata_db | "BLOOD_TYPE,AGE" | faker_source_t | sch | syndata_db | "BLOOD_TYPE,AGE" | CORRELATION_COEFFICIENT_DIFFERENCE | 0.001919343586 |
| 2024-07-30 09:53:28.439 Z | faker_synthetic_t | sch | syndata_db | "BLOOD_TYPE,ETHNICITY" | faker_source_t | sch | syndata_db | "BLOOD_TYPE,ETHNICITY" | CORRELATION_COEFFICIENT_DIFFERENCE | 0.003720197046 |
| 2024-07-30 09:53:28.439 Z | faker_synthetic_t | sch | syndata_db | "GENDER,AGE" | faker_source_t | sch | syndata_db | "GENDER,AGE" | CORRELATION_COEFFICIENT_DIFFERENCE | 0.004348586645 |
| 2024-07-30 09:53:28.439 Z | faker_synthetic_t | sch | syndata_db | "GENDER,ETHNICITY" | faker_source_t | sch | syndata_db | "GENDER,ETHNICITY" | CORRELATION_COEFFICIENT_DIFFERENCE | 0.001171535243 |
| 2024-07-30 09:53:28.439 Z | faker_synthetic_t | sch | syndata_db | "AGE,ETHNICITY" | faker_source_t | sch | syndata_db | "AGE,ETHNICITY" | CORRELATION_COEFFICIENT_DIFFERENCE | 0.004265938158 |
+---------------------------+-------------------+--------------+----------------+------------------------+-------------------+---------------------+-----------------------+------------------------+------------------------------------+----------------+