Data unloading considerations¶
This topic provides best practices, general guidelines, and important considerations for unloading data from a table. It is intended to help simplify exporting data from Snowflake tables into files in stages using the COPY INTO <location> command.
Empty strings and NULL values¶
An empty string is a string with zero length or no characters, whereas NULL values represent an absence of data. In CSV files, a NULL value is typically represented by two successive delimiters (e.g. ,,
) to indicate that the field contains no data; however, you can use string values to denote NULL (e.g. null
) or any unique string. An empty string is typically represented by a quoted empty string (e.g. ''
) to indicate that the string contains zero characters.
The following file format options enable you to differentiate between empty strings and NULL values when unloading or loading data. For more information about these file formats, see CREATE FILE FORMAT:
FIELD_OPTIONALLY_ENCLOSED_BY = 'character' | NONE
Use this option to enclose strings in the specified character: single quote (
'
), double quote ("
), or NONE.Enclosing string values in quotes while unloading data is not required. The COPY INTO location command can unload empty string values without enclosing quotes, with the EMPTY_FIELD_AS_NULL option set to FALSE. If the EMPTY_FIELD_AS_NULL option is TRUE (which is prohibited), then empty strings and NULL values are indistinguishable in the output file.
When a field contains this character, escape it using the same character. For example, if the value is the double quote character and a field contains the string
"A"
, escape the double quotes as follows:""A""
.Default:
NONE
EMPTY_FIELD_AS_NULL = TRUE | FALSE
When unloading empty string data from tables, choose one of the following options:
Preferred: Enclose strings in quotes by setting the
FIELD_OPTIONALLY_ENCLOSED_BY
option, to distinguish empty strings from NULLs in output CSV files.Leave string fields unenclosed by setting the
FIELD_OPTIONALLY_ENCLOSED_BY
option toNONE
(default), and set theEMPTY_FIELD_AS_NULL
value toFALSE
to unload empty strings as empty fields.Important
If you choose this option, make sure to specify a replacement string for NULL data using the
NULL_IF
option, to distinguish NULL values from empty strings in the output file. If you later choose to load data from the output files, you will specify the sameNULL_IF
value to identify the NULL values in the data files.
When loading data into tables, use this option to specify whether to insert SQL NULL for empty fields in an input file. If set to FALSE, Snowflake attempts to cast an empty field to the corresponding column type. An empty string is inserted into columns of data type STRING. For other column types, the COPY command produces an error.
Default:
TRUE
NULL_IF = ( 'string1' [ , 'string2' ... ] )
When unloading data from tables: Snowflake converts SQL NULL values to the first value in the list. Be careful to specify a value that you want interpreted as NULL. For example, if you are unloading data to a file that will get read by another system, make sure to specify a value that will be interpreted as NULL by that system.
Default:
\\N
(i.e. NULL, which assumes theESCAPE_UNENCLOSED_FIELD
value is\\
(default))
Example: Unloading and loading data with enclosing quotes¶
In the following example, a set of data is unloaded from the null_empty1
table to the user’s stage. The output data file is then used to load data into the null_empty2
table:
-- Source table (null_empty1) contents
+---+------+--------------+
| i | V | D |
|---+------+--------------|
| 1 | NULL | NULL value |
| 2 | | Empty string |
+---+------+--------------+
-- Create a file format that describes the data and the guidelines for processing it
create or replace file format my_csv_format
field_optionally_enclosed_by='0x27' null_if=('null');
-- Unload table data into a stage
copy into @mystage
from null_empty1
file_format = (format_name = 'my_csv_format');
-- Output the data file contents
1,'null','NULL value'
2,'','Empty string'
-- Load data from the staged file into the target table (null_empty2)
copy into null_empty2
from @mystage/data_0_0_0.csv.gz
file_format = (format_name = 'my_csv_format');
select * from null_empty2;
+---+------+--------------+
| i | V | D |
|---+------+--------------|
| 1 | NULL | NULL value |
| 2 | | Empty string |
+---+------+--------------+
Example: Unloading and loading data without enclosing quotes¶
In the following example, a set of data is unloaded from the null_empty1
table to the user’s stage. The output data file is then used to load data into the null_empty2
table:
-- Source table (null_empty1) contents
+---+------+--------------+
| i | V | D |
|---+------+--------------|
| 1 | NULL | NULL value |
| 2 | | Empty string |
+---+------+--------------+
-- Create a file format that describes the data and the guidelines for processing it
create or replace file format my_csv_format
empty_field_as_null=false null_if=('null');
-- Unload table data into a stage
copy into @mystage
from null_empty1
file_format = (format_name = 'my_csv_format');
-- Output the data file contents
1,null,NULL value
2,,Empty string
-- Load data from the staged file into the target table (null_empty2)
copy into null_empty2
from @mystage/data_0_0_0.csv.gz
file_format = (format_name = 'my_csv_format');
select * from null_empty2;
+---+------+--------------+
| i | V | D |
|---+------+--------------|
| 1 | NULL | NULL value |
| 2 | | Empty string |
+---+------+--------------+
Unloading to a single file¶
By default, COPY INTO location statements separate table data into a set of output files to take advantage of parallel operations. The maximum size for each file is set using the MAX_FILE_SIZE
copy option. The default value is 16777216
(16 MB) but can be increased to accommodate larger files. The maximum file size supported is 5 GB for Amazon S3, Google Cloud Storage, or Microsoft Azure stages.
To unload data to a single output file (at the potential cost of decreased performance), specify the SINGLE = true
copy option in your statement. You can optionally specify a name for the file in the path.
Note
If the COMPRESSION
option is set to true, specify a filename with the appropriate file extension for the compression method so that the output file can be decompressed. For example, specify the GZ file extension if the GZIP
compression method is specified.
For example, unload the mytable
table data to a single file named myfile.csv
in a named stage. Increase the MAX_FILE_SIZE
limit to accommodate the large data set:
copy into @mystage/myfile.csv.gz from mytable
file_format = (type=csv compression='gzip')
single=true
max_file_size=4900000000;
Unloading a relational table to JSON¶
You can use the OBJECT_CONSTRUCT function combined with the COPY command to convert the rows in a relational table to a single VARIANT column and unload the rows into a file.
For example:
-- Create a table
CREATE OR REPLACE TABLE mytable (
id number(8) NOT NULL,
first_name varchar(255) default NULL,
last_name varchar(255) default NULL,
city varchar(255),
state varchar(255)
);
-- Populate the table with data
INSERT INTO mytable (id,first_name,last_name,city,state)
VALUES
(1,'Ryan','Dalton','Salt Lake City','UT'),
(2,'Upton','Conway','Birmingham','AL'),
(3,'Kibo','Horton','Columbus','GA');
-- Unload the data to a file in a stage
COPY INTO @mystage
FROM (SELECT OBJECT_CONSTRUCT('id', id, 'first_name', first_name, 'last_name', last_name, 'city', city, 'state', state) FROM mytable)
FILE_FORMAT = (TYPE = JSON);
-- The COPY INTO location statement creates a file named data_0_0_0.json.gz in the stage.
-- The file contains the following data:
{"city":"Salt Lake City","first_name":"Ryan","id":1,"last_name":"Dalton","state":"UT"}
{"city":"Birmingham","first_name":"Upton","id":2,"last_name":"Conway","state":"AL"}
{"city":"Columbus","first_name":"Kibo","id":3,"last_name":"Horton","state":"GA"}
Unloading a relational table to Parquet with multiple columns¶
You can unload data in a relational table to a multi-column Parquet file by using a SELECT statement as input to the COPY statement. The SELECT statement specifies the column data in the relational table to include in the unloaded file. Use the HEADER = TRUE
copy option to include the column headers in the output files.
For example, unload the rows from three columns (id
, name
, start_date
) in the mytable
table into one or more files that have the naming format myfile.parquet
:
COPY INTO @mystage/myfile.parquet FROM (SELECT id, name, start_date FROM mytable)
FILE_FORMAT=(TYPE='parquet')
HEADER = TRUE;
Explicitly converting numeric columns to Parquet data types¶
By default, when table data is unloaded to Parquet files, fixed-point number columns are unloaded as DECIMAL columns, while floating-point number columns are unloaded as DOUBLE columns.
To choose the Parquet data types for sets of unloaded data, call the CAST , :: function in the COPY INTO <location> statement to convert specific table columns to explicit data types. A query in a COPY INTO <location> statement enables selecting specific columns to unload and accepts conversion SQL functions to transform the column data.
Queries in COPY INTO <location> statements support the syntax and semantics of SELECT statements to query specific Snowflake table columns to unload. Convert the data in numeric columns to specific data types using the CAST , :: function.
The following table maps Snowflake numeric data types to Parquet physical and logical data types:
Snowflake Logical Data Type |
Parquet Physical Data Type |
Parquet Logical Data Type |
---|---|---|
TINYINT |
INT32 |
INT(8) |
SMALLINT |
INT32 |
INT(16) |
INT |
INT32 |
INT(32) |
BIGINT |
INT64 |
INT(64) |
FLOAT |
FLOAT |
N/A |
DOUBLE |
DOUBLE |
N/A |
The following example shows a COPY INTO <location> statement that converts the numeric data in each unloaded column to a different data type to explicitly choose the data types in the Parquet files:
COPY INTO @mystage
FROM (SELECT CAST(C1 AS TINYINT) ,
CAST(C2 AS SMALLINT) ,
CAST(C3 AS INT),
CAST(C4 AS BIGINT) FROM mytable)
FILE_FORMAT=(TYPE=PARQUET);
Floating-point numbers truncated¶
When floating-point number columns are unloaded to CSV or JSON files, Snowflake truncates the values to approximately (15,9).
The values are not truncated when unloading floating-point number columns to Parquet files.