Preparing your data files¶

This topic provides best practices, general guidelines, and important considerations for preparing your data files for loading.

File sizing best practices¶

For best load performance and to avoid size limitations, consider the following data file sizing guidelines. Note that these recommendations apply to bulk data loads as well as continuous loading using Snowpipe.

General file sizing recommendations¶

The number of load operations that run in parallel can’t exceed the number of data files to be loaded. To optimize the number of parallel operations for a load, we recommend aiming to produce data files roughly 100-250 MB (or larger) in size compressed.

Note

Loading very large files (e.g. 100 GB or larger) is not recommended.

If you must load a large file, carefully consider the ON_ERROR copy option value. Aborting or skipping a file due to a small number of errors could result in delays and wasted credits. In addition, if a data loading operation continues beyond the maximum allowed duration of 24 hours, it could be aborted without any portion of the file being committed.

Aggregate smaller files to minimize the processing overhead for each file. Split larger files into a greater number of smaller files to distribute the load among the compute resources in an active warehouse. The number of data files that are processed in parallel is determined by the amount of compute resources in a warehouse. We recommend splitting large files by line to avoid records that span chunks.

If your source database does not allow you to export data files in smaller chunks, you can use a third-party utility to split large CSV files.

Linux or macOS¶

The split utility enables you to split a CSV file into multiple smaller files.

Syntax:

split [-a suffix_length] [-b byte_count[k|m]] [-l line_count] [-p pattern] [file [name]]

Copy

For more information, type man split in a terminal window.

Example:

split -l 100000 pagecounts-20151201.csv pages

Copy

This example splits a file named pagecounts-20151201.csv by line length. Suppose the large single file is 8 GB and contains 10 million lines. Split by 100,000, each of the 100 smaller files is 80 MB (10 million / 100,000 = 100). The split files are named pagessuffix.

Windows¶

Windows does not include a native file split utility; however, Windows supports many third-party tools and scripts that can split large data files.

Size limits for database objects¶

Note

To use the size limits described in this section, the 2025_03 behavior change bundle must be enabled in your account. This bundle is disabled by default.

If the 2025_03 behavior change bundle is disabled, the maximum allowed length for columns of type VARCHAR, VARIANT, ARRAY, and OBJECT is 16 MB, and the maximum allowed length for columns of type BINARY, GEOGRAPHY, and GEOMETRY is 8 MB.

To enable the bundle in your account, execute the following statement:

SELECT SYSTEM$ENABLE_BEHAVIOR_CHANGE_BUNDLE('2025_03');

Copy

For more information, see New maximum size limits for database objects (Pending).

When you use any of the available methods for loading data into Snowflake, you can now store objects with sizes larger than 16 MB in columns. The following limits apply for specific data types:

Data type	Storage limit
ARRAY	128 MB
BINARY	64 MB
GEOGRAPHY	64 MB
GEOMETRY	64 MB
OBJECT	128 MB
VARCHAR	128 MB
VARIANT	128 MB

The default size for VARCHAR columns remains 16 MB (8 MB for binary). When you create tables with columns larger than 16 MB, specify the size explicitly. For example:

CREATE OR REPLACE TABLE my_table (
  c1 VARCHAR(134217728),
  c2 BINARY(67108864));

Copy

To use the new limits for VARCHAR columns, you can alter tables to change the column size. For example:

ALTER TABLE my_table ALTER COLUMN col1 SET DATA TYPE VARCHAR(134217728);

Copy

To apply the new size to columns of type BINARY in these tables, recreate the tables. You can’t alter the length of a BINARY column in an existing table.

For columns of type ARRAY, GEOGRAPHY, GEOMETRY, OBJECT, and VARIANT, you can store objects larger than 16 MB in existing tables and new tables, by default, without specifying the length. For example:

CREATE OR REPLACE TABLE my_table (c1 VARIANT);

Copy

These changes don’t affect Unistore workloads. For hybrid tables, all current limits stay unchanged.

Procedures and functions that use VARIANT, VARCHAR, or BINARY values as input or output that were created before the new size limits were introduced also should be recreated (without specified length) to support objects larger than 16 MB. For example:

CREATE OR REPLACE FUNCTION udf_varchar(g1 VARCHAR)
  RETURNS VARCHAR
  AS $$
    'Hello' || g1
  $$;

Copy

For unmanaged Iceberg tables, the default length for VARCHAR and BINARY columns is 128 MB. This default length applies to newly created or refreshed tables. Tables that were created before the new size limits were enabled and not refreshed still have the previous default lengths.

For managed Iceberg tables, the default length for VARCHAR and BINARY columns is 128 MB. Tables that were created before the new size limits were enabled still have the previous default lengths. To apply the new size to columns of type VARCHAR in these tables, recreate the tables or alter the columns. The following example alters a column to use the new size limit:

ALTER ICEBERG TABLE my_iceberg_table ALTER COLUMN col1 SET DATA TYPE VARCHAR(134217728);

Copy

To apply the new size to columns of type BINARY in these tables, recreate the tables. You can’t alter the length of a BINARY column in an existing table.

Driver versions that support large objects in the result set¶

Drivers support objects larger than 16 MB (8 MB for BINARY, GEOMETRY, and GEOGRAPHY). You might need to update your drivers to the versions that support larger objects. The following driver versions are required:

Driver	Version	Release date
Snowpark Library for Python	1.21.0 or higher	August 19, 2024
Snowflake Connector for Python	3.10.0 or higher	April 29, 2024
JDBC	3.17.0 or higher	July 8, 2024
ODBC	3.6.0 or higher	March 17, 2025
Go Snowflake Driver	1.1.5 or higher	April 17, 2022
.NET	2.0.11 or higher	March 15, 2022
Snowpark Library for Scala and Java	1.14.0 or higher	September 14, 2024
Node.js	1.6.9 or higher	April 21, 2022
Spark connector	3.0.0 or higher	July 31, 2024
PHP	3.0.2 or higher	August 29, 2024
SnowSQL	1.3.2 or higher	August 12, 2024

If you try to use a driver that doesn’t support larger objects, an error similar to the following is returned:

100067 (54000): The data length in result column <column_name> is not supported by this version of the client.
Actual length <actual_size> exceeds supported length of 16777216.

Reducing the size of objects larger than 16 MB before loading¶

If the 2025_03 behavior change bundle is disabled, an error occurs if you attempt to load an object larger than the data type limit from a file on a stage into one of the following types of columns:

VARCHAR (including the types that are synonymous with VARCHAR)
BINARY (including the types that are synonymous with BINARY)
VARIANT
OBJECT
ARRAY
GEOGRAPHY
GEOMETRY

The following error occurs because the maximum size of an object stored in a column is 16 MB:

Max LOB size (16777216) exceeded

In the past, this error also occurred if you attempted to query a file on a stage, and the file contained objects larger than 16 MB.

Although you still can’t store objects larger than 16 MB in a column, you can now query objects up to 128 MB in files on a stage. You can then reduce the size of the objects before storing the objects in columns. An error no longer occurs when you query a file containing objects larger than 16 MB but smaller than 128 MB.

For example, you can split large objects across multiple columns or rows, transform nested JSON into a tabular format, or simplify complex geometries.

Example: Loading a large JSON file into separate rows¶

In general, JSON data sets are a simple concatenation of multiple documents. The JSON output from some software is composed of a single huge array containing multiple records. There is no need to separate the documents with line breaks or commas, but both are supported.

If the data exceeds 16 MB, enable the STRIP_OUTER_ARRAY file format option for the COPY INTO <table> command to remove the outer array structure and load the records into separate table rows:

COPY INTO <table>
  FROM @~/<file>.json
  FILE_FORMAT = (TYPE = 'JSON' STRIP_OUTER_ARRAY = true);

Copy

Example: Loading and splitting JSON objects from a Parquet file¶

Suppose that you’re loading a Parquet file from a stage, and the Parquet file contains a JSON object that’s larger than 16 MB in size:

{
  "ID": 1,
  "CustomerDetails": {
    "RegistrationDate": 158415500,
    "FirstName": "John",
    "LastName": "Doe",
    "Events": [
      {
        "Type": "LOGIN",
        "Time": 1584158401,
        "EventID": "NZ0000000001"
      },
      /* ... */
      /* this array contains thousands of elements */
      /* with total size exceeding 16 MB */
      /* ... */
      {
        "Type": "LOGOUT",
        "Time": 1584158402,
        "EventID": "NZ0000000002"
      }
    ]
  }
}

Copy

The following example creates a table to store the data from the file and loads the data into the table. Because the size of the array of events can exceed 16 MB, the example expands the array of events into separate rows (one for each array element).

CREATE OR REPLACE TABLE mytable AS
  SELECT
    t1.$1:ID AS id,
    t1.$1:CustomerDetails:RegistrationDate::VARCHAR AS RegistrationDate,
    t1.$1:CustomerDetails:FirstName::VARCHAR AS First_Name,
    t1.$1:CustomerDetails:LastName::VARCHAR AS as Last_Name,
    t2.value AS Event
  FROM @json t1,
    TABLE(FLATTEN(INPUT => $1:CustomerDetails:Events)) t2;

Copy

The following is an example of the contents of the resulting table.

+----+-------------------+------------+------------+------------------------------+
| ID | REGISTRATION_DATE | FIRST_NAME | LAST_NAME  | EVENT                        |
|----+-------------------+------------+------------+------------------------------|
| 1  | 158415500         | John       | Doe        | {                            |
|    |                   |            |            |   "EventID": "NZ0000000001", |
|    |                   |            |            |   "Time": 1584158401,        |
|    |                   |            |            |   "Type": "LOGIN"            |
|    |                   |            |            | }                            |
|                     ... thousands of rows ...                                   |
| 1  | 158415500         | John       | Doe        | {                            |
|    |                   |            |            |   "EventID": "NZ0000000002", |
|    |                   |            |            |   "Time": 1584158402,        |
|    |                   |            |            |   "Type": "LOGOUT"           |
|    |                   |            |            | }                            |
+----+-------------------+------------+------------+------------------------------+

Inserting FLATTEN results into an existing table¶

To insert the results of the FLATTEN function into an existing table, use an INSERT statement. For example:

CREATE OR REPLACE TABLE mytable (
  id VARCHAR,
  registration_date VARCHAR(16777216),
  first_name VARCHAR(16777216),
  last_name VARCHAR(16777216),
  event VARCHAR(16777216));

INSERT INTO mytable
  SELECT
    t1.$1:ID,
    t1.$1:CustomerDetails:RegistrationDate::VARCHAR,
    t1.$1:CustomerDetails:FirstName::VARCHAR,
    t1.$1:CustomerDetails:LastName::VARCHAR,
    t2.value
  FROM @json t1,
    TABLE(FLATTEN(INPUT => $1:CustomerDetails:Events)) t2;

Copy

Example: Loading and splitting XML¶

Suppose that you’re loading an XML file from a stage, and it contains an XML object that’s larger than 16 MB:

<?xml version='1.0' encoding='UTF-8'?>
<osm version="0.6" generator="osmium/1.14.0">
  <node id="197798" version="17" timestamp="2021-09-06T17:01:27Z" />
  <node id="197824" version="7" timestamp="2021-08-04T23:17:18Z" >
    <tag k="highway" v="traffic_signals"/>
  </node>
  <!--  thousands of node elements with total size exceeding 16 MB -->
  <node id="197826" version="4" timestamp="2021-08-04T16:43:28Z" />
</osm>

Copy

The following example creates a table to store the data from the file and loads the data into the table. Because the size of the XML can exceed 16 MB, the example expands each node into separate rows.

CREATE OR REPLACE TABLE mytable AS
  SELECT
    value:"@id" AS id,
    value:"@version" AS version,
    value:"@timestamp"::datetime AS TIMESTAMP,
    value:"$" AS tags
  FROM @mystage,
    LATERAL FLATTEN(INPUT => $1:"$")
  WHERE value:"@" = 'node';

Copy

The following is an example of the contents of the resulting table.

+--------+---------+-------------------------+---------------------------------------------+
| ID     | VERSION | TIMESTAMP               | TAGS                                        |
|--------+---------+-------------------------+---------------------------------------------|
| 197798 | 17      | 2021-09-06 17:01:27.000 | ""                                          |
| 197824 | 7       | 2021-08-04 23:17:18.000 | <tag k="highway" v="traffic_signals"></tag> |
|                   ... thousands of rows ...                                              |
| 197826 | 4       | 2021-08-04 16:43:28.000 | ""                                          |
+--------+---------+-------------------------+---------------------------------------------+

Example: Loading and simplifying large geospatial objects before storing them¶

Suppose that you’re loading a Parquet file from a stage, and the Parquet file contains a geospatial object that’s larger than 16 MB. You can load the file from the stage and simplify the geospatial object (by using ST_SIMPLIFY) before you store the object:

CREATE OR REPLACE TABLE mytable AS
  SELECT
    ST_SIMPLIFY($1:geo, 10) AS geo
  FROM @mystage;

Copy

Example: Using COPY INTO <table>¶

If you need to use COPY INTO <table> to load data from a file on a stage, you can’t use FLATTEN to split up large objects. Instead, use SELECT. For example:

CREATE OR REPLACE TABLE mytable (
  id VARCHAR,
  registration_date VARCHAR,
  first_name VARCHAR,
  last_name VARCHAR);

COPY INTO mytable (
  id,
  registration_date,
  first_name,
  last_name
) FROM (
    SELECT
      $1:ID,
      $1:CustomerDetails::OBJECT:RegistrationDate::VARCHAR,
      $1:CustomerDetails::OBJECT:FirstName::VARCHAR,
      $1:CustomerDetails::OBJECT:LastName::VARCHAR
    FROM @mystage
);

Copy

Continuous data loads (i.e. Snowpipe) and file sizing¶

Snowpipe is designed to load new data typically within a minute after a file notification is sent; however, loading can take significantly longer for really large files or in cases where an unusual amount of compute resources is necessary to decompress, decrypt, and transform the new data.

In addition to resource consumption, an overhead to manage files in the internal load queue is included in the utilization costs charged for Snowpipe. This overhead increases in relation to the number of files queued for loading. This overhead charge appears as Snowpipe charges in your billing statement because Snowpipe is used for event notifications for the automatic external table refreshes.

For the most efficient and cost-effective load experience with Snowpipe, we recommend following the file sizing recommendations in File sizing best practices (in this topic). Loading data files roughly 100-250 MB or larger reduces the overhead charge relative to the amount of total data loaded to the point where the overhead cost is immaterial.

If it takes longer than one minute to accumulate MBs of data in your source application, consider creating a new (potentially smaller) data file once per minute. This approach typically leads to a good balance between cost (i.e. resources spent on Snowpipe queue management and the actual load) and performance (i.e. load latency).

Creating smaller data files and staging them in cloud storage more often than once per minute has the following disadvantages:

A reduction in latency between staging and loading the data can’t be guaranteed.
An overhead to manage files in the internal load queue is included in the utilization costs charged for Snowpipe. This overhead increases in relation to the number of files queued for loading.

Various tools can aggregate and batch data files. One convenient option is Amazon Data Firehose. Firehose allows defining both the desired file size, called the buffer size, and the wait interval after which a new file is sent (to cloud storage in this case), called the buffer interval. For more information, see the Amazon Data Firehose documentation. If your source application typically accumulates enough data within a minute to populate files larger than the recommended maximum for optimal parallel processing, you could decrease the buffer size to trigger delivery of smaller files. Keeping the buffer interval setting at 60 seconds (the minimum value) helps avoid creating too many files or increasing latency.

Preparing delimited text files¶

Consider the following guidelines when preparing your delimited text (CSV) files for loading:

UTF-8 is the default character set, however, additional encodings are supported. Use the ENCODING file format option to specify the character set for the data files. For more information, see CREATE FILE FORMAT.
Fields that contain delimiter characters should be enclosed in quotes (single or double). If the data contains single or double quotes, then those quotes must be escaped.
Carriage returns are commonly introduced on Windows systems in conjunction with a line feed character to mark the end of a line (\r \n). Fields that contain carriage returns should also be enclosed in quotes (single or double).
The number of columns in each row should be consistent.

Semi-structured data files and subcolumnarization¶

When semi-structured data is inserted into a VARIANT column, Snowflake uses certain rules to extract as much of the data as possible to a columnar form. The rest of the data is stored as a single column in a parsed semi-structured structure.

By default, Snowflake extracts a maximum of 200 elements per partition, per table. To increase this limit, contact Snowflake Support.

Elements that are not extracted¶

Elements with the following characteristics are not extracted into a column:

Elements that contain even a single “null” value are not extracted into a column. This applies to elements with “null” values and not to elements with missing values, which are represented in columnar form.

This rule ensures that no information is lost (that is, that the difference between VARIANT “null” values and SQL NULL values is not lost).
Elements that contain multiple data types. For example:

The foo element in one row contains a number:
```
{"foo":1}
```
Copy
The same element in another row contains a string:
```
{"foo":"1"}
```
Copy

How extraction impacts queries¶

When you query a semi-structured element, Snowflake’s execution engine behaves differently according to whether an element was extracted.

If the element was extracted into a column, the engine scans only the extracted column.
If the element was not extracted into a column, the engine must scan the entire JSON structure, and then for each row traverse the structure to output values. This impacts performance.

To avoid the performance impact for elements that were not extracted, do the following:

Extract semi-structured data elements containing “null” values into relational columns before you load them.

Alternatively, if the “null” values in your files indicate missing values and have no other special meaning, we recommend setting the file format option STRIP_NULL_VALUES to TRUE when you load the semi-structured data files. This option removes OBJECT elements or ARRAY elements containing “null” values.
Ensure each unique element stores values of a single data type that is native to the format (for example, string or number for JSON).

Numeric data guidelines¶

Avoid embedded characters, such as commas (e.g. 123,456).
If a number includes a fractional component, it should be separated from the whole number portion by a decimal point (e.g. 123456.789).
Oracle only. The Oracle NUMBER or NUMERIC types allow for arbitrary scale, meaning they accept values with decimal components even if the data type was not defined with a precision or scale. Whereas in Snowflake, columns designed for values with decimal components must be defined with a scale to preserve the decimal portion.

Date and timestamp data guidelines¶

For information on the supported formats for date, time, and timestamp data, see Date and time input and output formats.
Oracle only. The Oracle DATE data type can contain date or timestamp information. If your Oracle database includes DATE columns that also store time-related information, map these columns to a TIMESTAMP data type in Snowflake rather than DATE.

Note

Snowflake checks temporal data values at load time. Invalid date, time, and timestamp values (e.g. 0000-00-00) produce an error.