Preparing Your Data Files¶
This topic provides best practices, general guidelines, and important considerations for preparing your data files for loading.
In this Topic:
File Sizing Best Practices and Limitations¶
For best load performance and to avoid size limitations, consider the following data file sizing guidelines. Note that these recommendations apply to bulk data loads as well as continuous loading using Snowpipe.
General File Sizing Recommendations¶
The number of load operations that run in parallel cannot exceed the number of data files to be loaded. To optimize the number of parallel operations for a load, we recommend aiming to produce data files roughly 100-250 MB (or larger) in size compressed.
Loading very large files (e.g. 100 GB or larger) is not recommended.
Aggregate smaller files to minimize the processing overhead for each file. Split larger files into a greater number of smaller files to distribute the load among the servers in an active warehouse. The number of data files that are processed in parallel is determined by the number and capacity of servers in a warehouse. We recommend splitting large files by line to avoid records that span chunks.
If your source database does not allow you to export data files in smaller chunks, you can use a third-party utility to split large CSV files.
Linux or macOS¶
split utility enables you to split a CSV file into multiple smaller files.
split [-a suffix_length] [-b byte_count[k|m]] [-l line_count] [-p pattern] [file [name]]
For more information, type
man split in a terminal window.
split -l 100000 pagecounts-20151201.csv pages
This example splits a file named
pagecounts-20151201.csv by line length. Suppose the large single file is 8 GB in size and contains 10 million lines. Split by 100,000, each of the 100 smaller files is 80 MB in size (10 million / 100,000 = 100). The split files are named
Windows does not include a native file split utility; however, Windows supports many third-party tools and scripts that can split large data files.
Semi-structured Data Size Limitations¶
The VARIANT data type imposes a 16 MB (compressed) size limit on individual rows.
In general, JSON data sets are a simple concatenation of multiple documents. The JSON output from some software is composed of a single huge array containing multiple records. There is no need to separate the documents with line breaks or commas, though both are supported.
Instead, we recommend enabling the STRIP_OUTER_ARRAY file format option for the COPY INTO <table> command to remove the outer array structure and load the records into separate table rows:
COPY INTO <table> FROM @~/<file>.json FILE_FORMAT = (TYPE = 'JSON' STRIP_OUTER_ARRAY = true);
Parquet Data Size Limitations¶
Currently, data loads of large Parquet files (e.g. greater than 3 GB) could time out. Split large files into files 1 GB in size (or smaller) for loading.
Continuous Data Loads (i.e. Snowpipe) and File Sizing¶
Snowpipe is designed to load new data typically within a minute after a file notification is sent; however, loading can take significantly longer for really large files or in cases where an unusual amount of compute resources is necessary to decompress, decrypt, and transform the new data.
In addition to resource consumption, an overhead to manage files in the internal load queue is included in the utilization costs charged for Snowpipe. This overhead increases in relation to the number of files queued for loading. Snowpipe charges 0.06 credits per 1000 files queued.
For the most efficient and cost-effective load experience with Snowpipe, we recommend following the file sizing recommendations in File Sizing Best Practices and Limitations (in this topic). Loading data files roughly 100-250 MB in size or larger reduces the overhead charge relative to the amount of total data loaded to the point where the overhead cost is immaterial.
If it takes longer than one minute to accumulate MBs of data in your source application, consider creating a new (potentially smaller) data file once per minute. This approach typically leads to a good balance between cost (i.e. resources spent on Snowpipe queue management and the actual load) and performance (i.e. load latency).
Creating smaller data files and staging them in cloud storage more often than once per minute has the following disadvantages:
A reduction in latency between staging and loading the data cannot be guaranteed.
An overhead to manage files in the internal load queue is included in the utilization costs charged for Snowpipe. This overhead increases in relation to the number of files queued for loading.
Various tools can aggregate and batch data files. One convenient option is Amazon Kinesis Firehose. Firehose allows defining both the desired file size, called the buffer size, and the wait interval after which a new file is sent (to cloud storage in this case), called the buffer interval. For more information, see the Kinesis Firehose documentation
If your source application typically accumulates enough data within a minute to populate files larger than the recommended maximum for optimal parallel processing, you could increase the buffer size. Keeping the buffer interval setting at 60 seconds (the minimum value) helps avoid creating too many files or increasing latency.
Preparing Delimited Text Files¶
Consider the following guidelines when preparing your delimited text (CSV) files for loading:
UTF-8 is the default character set, however, additional encodings are supported. Use the ENCODING file format option to specify the character set for the data files. For more information, see CREATE FILE FORMAT.
Fields that contain delimiter characters should be enclosed in quotes (single or double). If the data contains single or double quotes, then those quotes must be escaped.
Carriage returns are commonly introduced on Windows systems in conjunction with a line feed character to mark the end of a line (
\r \n). Fields that contain carriage returns should also be enclosed in quotes (single or double).
The number of columns in each row should be consistent.
Semi-structured Data Files and Columnarization¶
When semi-structured data is inserted into a VARIANT column, Snowflake extracts as much of the data as possible to a columnar form, based on certain rules. The rest is stored as a single column in a parsed semi-structured structure. Currently, elements that have the following characteristics are not extracted into a column:
Elements that contain even a single “null” value are not extracted into a column. Note that this applies to elements with “null” values and not to elements with missing values, which are represented in columnar form.
This rule ensures that information is not lost, i.e, the difference between VARIANT “null” values and SQL NULL values is not obfuscated.
Elements that contain multiple data types. For example:
fooelement in one row contains a number:
The same element in another row contains a string:
When a semi-structured element is queried:
If the element was extracted into a column, Snowflake’s execution engine (which is columnar) scans only the extracted column.
If the element was not extracted into a column, the execution engine must scan the entire JSON structure, and then for each row traverse the structure to output values, impacting performance.
To avoid this performance impact:
Extract semi-structured data elements containing “null” values into relational columns before loading them.
Alternatively, if the “null” values in your files indicate missing values and have no other special meaning, we recommend setting the file format option STRIP_NULL_VALUES to TRUE when loading the semi-structured data files. This option removes object elements or array elements containing “null” values.
Ensure each unique element stores values of a single native data type (string or number).
Numeric Data Guidelines¶
Avoid embedded characters, such as commas (e.g.,
If a number includes a fractional component, it should be separated from the whole number portion by a decimal point (e.g.,
Oracle only. The Oracle NUMBER or NUMERIC types allow for arbitrary scale, meaning they accept values with decimal components even if the data type was not defined with a precision or scale. Whereas in Snowflake, columns designed for values with decimal components must be defined with a scale to preserve the decimal portion.
Date and Timestamp Data Guidelines¶
Date, time, and timestamp data should be formatted based on the following components:
Two-digit year, controlled by the TWO_DIGIT_CENTURY_START session parameter, e.g. when set to
1980, values of
Two-digit month (01=January, etc.).
Full or abbreviated month name.
Two-digit day of month (01 through 31).
Abbreviated day of week.
Two digits for hour (00 through 23); am/pm not allowed.
Two digits for hour (01 through 12); am/pm allowed.
Ante meridiem (am) / post meridiem (pm); for use with
Two digits for minute (00 through 59).
Two digits for second (00 through 59).
Fractional seconds with precision 0 (seconds) to 9 (nanoseconds), e.g.
FFis equivalent to
Time zone hour and minute, offset from UTC. Can be prefixed by
Oracle only. The Oracle DATE data type can contain date or timestamp information. If your Oracle database includes DATE columns that also store time-related information, map these columns to a TIMESTAMP data type in Snowflake rather than DATE.
Snowflake checks temporal data values at load time. Invalid date, time, and timestamp values (e.g.,
0000-00-00) produce an error.