Overview of Data Loading¶
This topic provides an overview of the main options available to load data into Snowflake.
In this Topic:
Supported File Locations¶
Snowflake refers to the location of data files in cloud storage as a stage. The COPY INTO <table> command used for both bulk and continuous data loads (i.e. Snowpipe) supports cloud storage accounts managed by your business entity (i.e. external stages) as well as cloud storage contained in your Snowflake account (i.e. internal stages).
Loading data from any of the following cloud storage services is supported regardless of the cloud platform that hosts your Snowflake account:
Google Cloud Storage
You cannot access data held in archival cloud storage classes that requires restoration before it can be retrieved. These archival storage classes include, for example, the Amazon S3 Glacier Flexible Retrieval or Glacier Deep Archive storage class, or Microsoft Azure Archive Storage.
Upload (i.e. stage) files to your cloud storage account using the tools provided by the cloud storage service.
A named external stage is a database object created in a schema. This object stores the URL to files in cloud storage, the settings used to access the cloud storage account, and convenience settings such as the options that describe the format of staged files. Create stages using the CREATE STAGE command.
Some data transfer billing charges may apply when loading data from files in a cloud storage service in a different region or cloud platform from your Snowflake account. For more information, see Understanding Data Transfer Billing.
Snowflake maintains the following stage types in your account:
A user stage is allocated to each user for storing files. This stage type is designed to store files that are staged and managed by a single user but can be loaded into multiple tables. User stages cannot be altered or dropped.
A table stage is available for each table created in Snowflake. This stage type is designed to store files that are staged and managed by one or more users but only loaded into a single table. Table stages cannot be altered or dropped.
Note that a table stage is not a separate database object; rather, it is an implicit stage tied to the table itself. A table stage has no grantable privileges of its own. To stage files to a table stage, list the files, query them on the stage, or drop them, you must be the table owner (have the role with the OWNERSHIP privilege on the table).
A named internal stage is a database object created in a schema. This stage type can store files that are staged and managed by one or more users and loaded into one or more tables. Because named stages are database objects, the ability to create, modify, use, or drop them can be controlled using security access control privileges. Create stages using the CREATE STAGE command.
Upload files to any of the internal stage types from your local file system using the PUT command.
Bulk vs Continuous Loading¶
Snowflake provides the following main solutions for data loading. The best solution may depend upon the volume of data to load and the frequency of loading.
Bulk Loading Using the COPY Command¶
This option enables loading batches of data from files already available in cloud storage, or copying (i.e. staging) data files from a local machine to an internal (i.e. Snowflake) cloud storage location before loading the data into tables using the COPY command.
Bulk loading relies on user-provided virtual warehouses, which are specified in the COPY statement. Users are required to size the warehouse appropriately to accommodate expected loads.
Simple Transformations During a Load¶
Snowflake supports transforming data while loading it into a table using the COPY command. Options include:
Truncating text strings that exceed the target column length
There is no requirement for your data files to have the same number and ordering of columns as your target table.
Continuous Loading Using Snowpipe¶
This option is designed to load small volumes of data (i.e. micro-batches) and incrementally make them available for analysis. Snowpipe loads data within minutes after files are added to a stage and submitted for ingestion. This ensures users have the latest results, as soon as the raw data is available.
Snowpipe uses compute resources provided by Snowflake (i.e. a serverless compute model). These Snowflake-provided resources are automatically resized and scaled up or down as required, and are charged and itemized using per-second billing. Data ingestion is charged based upon the actual workloads.
Simple Transformations During a Load¶
The COPY statement in a pipe definition supports the same COPY transformation options as when bulk loading data.
In addition, data pipelines can leverage Snowpipe to continously load micro-batches of data into staging tables for transformation and optimization using automated tasks and the change data capture (CDC) information in streams.
Data Pipelines for Complex Transformations¶
A data pipeline enables applying complex transformations to loaded data. This workflow generally leverages Snowpipe to load “raw” data into a staging table and then uses a series of table streams and tasks to transform and optimize the new data for analysis.
Loading Data from Apache Kafka Topics¶
Detection of Column Definitions in Staged Semi-structured Data Files¶
Semi-structured data can include thousands of columns. Snowflake provides robust solutions for handling this data. Options include referencing the data directly in cloud storage using external tables, loading the data into a single column of type VARIANT, or transforming and loading the data into separate columns in a standard relational table. All of these options require some knowledge of the column definitions in the data.
A different solution involves automatically detecting the schema in a set of staged semi-structured data files and retrieving the column definitions. The column definitions include the names, data types, and ordering of columns in the files. Generate syntax in a format suitable for creating Snowflake standard tables, external tables, or views.
This feature is currently limited to Apache Parquet, Apache Avro, and ORC files.
This support is implemented through the following SQL functions:
Detects the column definitions in a set of staged data files and retrieves the metadata in a format suitable for creating Snowflake objects.
Generates a list of columns from a set of staged files using the INFER_SCHEMA function output.
These SQL functions support both internal and external stages.
Create tables with the column definitions derived from a set of staged files using the
CREATE TABLE … USING TEMPLATE syntax. The USING TEMPLATE clause accepts an expression that
calls the INFER_SCHEMA SQL function to detect the column definitions in the files. Once the table is created, you may then use a COPY statement with the
MATCH_BY_COLUMN_NAME option to load files directly into the structured table.
Alternatives to Loading Data¶
It is not always necessary to load data into Snowflake before executing queries.
External Tables (Data Lake)¶
External tables enable querying existing data stored in external cloud storage for analysis without first loading it into Snowflake. The source of truth for the data remains in the external cloud storage. Data sets materialized in Snowflake via materialized views are read-only.
This solution is especially beneficial to accounts that have a large amount of data stored in external cloud storage and only want to query a portion of the data; for example, the most recent data. Users can create materialized views on subsets of this data for improved query performance.