Preparing to Load Data

This topic provides an overview of supported data file formats and data compression. Depending on your data’s structure, you might need to prepare the data before loading it.

In this Topic:

Supported Data Types

See Data Types for descriptions of the data types supported by Snowflake.

Data File Compression

We recommend that you compress your data files when you are loading large data sets. See CREATE FILE FORMAT for the compression algorithms supported for each data type.

When loading compressed data, specify the compression method for your data files. The COMPRESSION file format option describes how your data files are already compressed in the stage. Set the COMPRESSION option in one of the following ways:

  • As a file format option specified directly in the COPY INTO <table> statement.

  • As a file format option specified for a named file format or stage object. The named file format/stage object can then be referenced in the COPY INTO <table> statement.

Supported File Formats

The following file formats are supported:

Structured/Semi-structured

Type

Notes

Structured

Delimited (CSV, TSV, etc.)

Any valid singlebyte delimiter is supported; default is comma (i.e. CSV).

Semi-structured

JSON

Avro

Includes automatic detection and processing of Snappy-compressed Avro files.

ORC

Includes automatic detection and processing of Snappy- or zlib-compressed ORC files.

Parquet

Includes automatic detection and processing of Snappy-compressed Parquet files.

XML

Supported as a preview feature.

File format options specify the type of data contained in a file, as well as other related characteristics about the format of the data. The file format options you can specify are different depending on the type of data you plan to load. Snowflake provides a full set of file format option defaults.

Semi-structured File Formats

Snowflake natively supports semi-structured data, which means semi-structured data can be loaded into relational tables without requiring the definition of a schema in advance. Snowflake supports loading semi-structured data directly into columns of type VARIANT (see Semi-structured Data Types for more details).

Currently supported semi-structured data formats include JSON, Avro, ORC, Parquet, or XML:

  • For JSON, Avro, ORC, and Parquet data, each top-level, complete object is loaded as a separate row in the table. Each object can contain new line characters and spaces as long as the object is valid.

  • For XML data, each top-level element is loaded as a separate row in the table. An element is identified by a start and close tag of the same name.

Typically, tables used to store semi-structured data consist of a single VARIANT column. Once the data is loaded, you can query the data similar to structured data. You can also perform other tasks, such as extracting values and objects from arrays. For more information, see the FLATTEN table function.

Note

Semi-structured data can be loaded into tables with multiple columns, but the semi-structured data must be stored as a field in a structured file (e.g. CSV file). Then, the data can be loaded into a specified column in the table.

Named File Formats

Snowflake supports creating named file formats, which are database objects that encapsulate all of the required format information. Named file formats can then be used as input in all the same places where you can specify individual file format options, thereby helping to streamline the data loading process for similarly-formatted data.

Named file formats are optional, but are recommended when you plan to regularly load similarly-formatted data.

Creating a Named File Format

You can create a file format using either the web interface or SQL:

Web Interface

Click on Databases Databases tab » <db_name> » File Formats

SQL

CREATE FILE FORMAT

For descriptions of all file format options and the default values, see CREATE FILE FORMAT.

Overriding Default File Format Options

You can define the file format settings for your staged data (i.e. override the default settings) in any of the following locations:

In the table definition

Explicitly set the options using the FILE_FORMAT parameter. For more information, see CREATE TABLE.

In the named stage definition

Explicitly set the options using the FILE_FORMAT parameter. The stage is then referenced in the COPY INTO TABLE statement. For more information, see CREATE STAGE.

Directly in the COPY INTO TABLE statement when loading data

Explicitly set the options separately. For more information, see COPY INTO <table>.

If file format options are specified in multiple locations, the load operation applies the options in the following order of precedence:

  1. COPY INTO TABLE statement.

  2. Stage definition.

  3. Table definition.

Note

File format options set in multiple locations are not cumulative. Any options set in one place override all options (whether the same or different options) set lower in the order of precedence.

File Format Support in Snowflake Features

The following table indicates the support for file format options across the feature set:

TYPE = CSV

File Format Option

Copy Load [1]

Copy Unload

External Table

Query on Stage [2]

COMPRESSION

RECORD_DELIMITER

FIELD_DELIMITER

FILE_EXTENSION

SKIP_HEADER

SKIP_BLANK_LINES

DATE_FORMAT

TIME_FORMAT

TIMESTAMP_FORMAT

BINARY_FORMAT

ESCAPE

ESCAPE_UNENCLOSED_FIELD

TRIM_SPACE

FIELD_OPTIONALLY_ENCLOSED_BY

NULL_IF

ERROR_ON_COLUMN_COUNT_MISMATCH

REPLACE_INVALID_CHARACTERS

VALIDATE_UTF8

EMPTY_FIELD_AS_NULL

SKIP_BYTE_ORDER_MARK

ENCODING

[1] Includes bulk data loading and Snowpipe.

[2] Includes querying data in staged files and transforming data during a load (via either bulk data loading or Snowpipe).

TYPE = JSON

File Format Option

Copy Load [1]

Copy Unload

External Table

Query on Stage [2]

COMPRESSION

FILE_EXTENSION

ENABLE_OCTAL

ALLOW_DUPLICATE

STRIP_OUTER_ARRAY

STRIP_NULL_VALUES

IGNORE_UTF8_ERRORS

SKIP_BYTE_ORDER_MARK

NULL_IF

TRIM_SPACE

DATE_FORMAT

TIME_FORMAT

TIMESTAMP_FORMAT

BINARY_FORMAT

[1] Includes bulk data loading and Snowpipe.

[2] Includes querying data in staged files and transforming data during a load (via either bulk data loading or Snowpipe).

TYPE = AVRO

File Format Option

Copy Load [1]

Copy Unload

External Table

Query on Stage [2]

COMPRESSION

NULL_IF

TRIM_SPACE

[1] Includes bulk data loading and Snowpipe.

[2] Includes querying data in staged files and transforming data during a load (via either bulk data loading or Snowpipe).

TYPE = ORC

File Format Option

Copy Load [1]

Copy Unload

External Table

Query on Stage [2]

NULL_IF

TRIM_SPACE

[1] Includes bulk data loading and Snowpipe.

[2] Includes querying data in staged files and transforming data during a load (via either bulk data loading or Snowpipe).

TYPE = PARQUET

File Format Option

Copy Load [1]

Copy Unload

External Table

Query on Stage [2]

COMPRESSION

SNAPPY_COMPRESSION

BINARY_AS_TEXT

NULL_IF

TRIM_SPACE

[1] Includes bulk data loading and Snowpipe.

[2] Includes querying data in staged files and transforming data during a load (via either bulk data loading or Snowpipe).

TYPE = XML

File Format Option

Copy Load [1]

Copy Unload

External Table

Query on Stage [2]

COMPRESSION

IGNORE_UTF8_ERRORS

???

PRESERVE_SPACE

???

STRIP_OUTER_ELEMENT

???

DISABLE_SNOWFLAKE_DATA

???

DISABLE_AUTO_CONVERT

???

SKIP_BYTE_ORDER_MARK

???

NULL_IF

???

TRIM_SPACE

???

[1] Includes bulk data loading and Snowpipe.

[2] Includes querying data in staged files and transforming data during a load (via either bulk data loading or Snowpipe).

Supported Copy Options

Copy options determine the behavior of a data load with regard to error handling, maximum data size, and so on.

For descriptions of all copy options and the default values, see COPY INTO <table>.

Overriding Default Copy Options

You can specify the desired load behavior (i.e. override the default settings) in any of the following locations:

In the table definition

Explicitly set the options using the STAGE_COPY_OPTIONS parameter. For more information, see CREATE TABLE.

In the named stage definition

Explicitly set the options using the COPY_OPTIONS parameter. The stage is then referenced in the COPY INTO TABLE statement. For more information, see CREATE STAGE.

Directly in the COPY INTO TABLE statement when loading data

Explicitly set the options separately. For more information, see COPY INTO <table>.

If copy options are specified in multiple locations, the load operation applies the options in the following order of precedence:

  1. COPY INTO TABLE statement.

  2. Stage definition.

  3. Table definition.

Note

Copy options set in multiple locations are cumulative. Individual options set in one place override the same option set lower in the order of precedence.