Tutorial: Bulk loading from Amazon S3 using COPY

Introduction

This tutorial describes how to load data from files in an existing Amazon Simple Storage Service (Amazon S3) bucket into a table. In this tutorial, you will learn how to:

  • Create named file formats that describe your data files.

  • Create named stage objects.

  • Load data located in your S3 bucket into Snowflake tables.

  • Resolve errors in your data files.

The tutorial covers loading of both CSV and JSON data.

Prerequisites

The tutorial assumes the following:

  • You have a Snowflake account that is configured to use Amazon Web Services (AWS) and a user with a role that grants the necessary privileges to create a database, tables, and virtual warehouse objects.

  • You have SnowSQL installed.

Refer to the Snowflake in 20 minutes for instructions to meet these requirements.

Snowflake provides sample data files in a public Amazon S3 bucket for use in this tutorial. But before you start, you need to create a database, tables, and a virtual warehouse for this tutorial. These are the basic Snowflake objects needed for most Snowflake activities.

About the sample data files

Snowflake provides sample data files staged in a public S3 bucket.

Note

In regular use, you would stage your own data files using the AWS Management Console, AWS Command Line Interface, or an equivalent client application. See the Amazon Web Services documentation for instructions.

The sample data files include sample contact information in the following formats:

  • CSV files that contain a header row and five records. The field delimiter is the pipe (|) character. The following example shows a header row and one record:

    ID|lastname|firstname|company|email|workphone|cellphone|streetaddress|city|postalcode
    6|Reed|Moses|Neque Corporation|eget.lacus@facilisis.com|1-449-871-0780|1-454-964-5318|Ap #225-4351 Dolor Ave|Titagarh|62631
    
    Copy
  • A single file in JSON format that contains one array and three objects. The following is an example of an array that contains one of the objects:

    [
     {
       "customer": {
         "address": "509 Kings Hwy, Comptche, Missouri, 4848",
         "phone": "+1 (999) 407-2274",
         "email": "blankenship.patrick@orbin.ca",
         "company": "ORBIN",
         "name": {
           "last": "Patrick",
           "first": "Blankenship"
         },
         "_id": "5730864df388f1d653e37e6f"
       }
     },
    ]
    
    Copy

Creating the database, tables, and warehouse

Execute the following statements to create a database, two tables (for csv and json data), and a virtual warehouse needed for this tutorial. After you complete the tutorial, you can drop these objects.

CREATE OR REPLACE DATABASE mydatabase;

CREATE OR REPLACE TEMPORARY TABLE mycsvtable (
     id INTEGER,
     last_name STRING,
     first_name STRING,
     company STRING,
     email STRING,
     workphone STRING,
     cellphone STRING,
     streetaddress STRING,
     city STRING,
     postalcode STRING);

CREATE OR REPLACE TEMPORARY TABLE myjsontable (
     json_data VARIANT);

CREATE OR REPLACE WAREHOUSE mywarehouse WITH
     WAREHOUSE_SIZE='X-SMALL'
     AUTO_SUSPEND = 120
     AUTO_RESUME = TRUE
     INITIALLY_SUSPENDED=TRUE;
Copy

Note the following:

  • The CREATE DATABASE statement creates a database. The database automatically includes a schema named ‘public’.

  • The CREATE TABLE statements create target tables for CSV and JSON data. The tables are temporary, that is, they persist only for the duration of the user session and are not visible to other users.

  • The CREATE WAREHOUSE statement creates an initially suspended warehouse. The statement also sets AUTO_RESUME = true, which starts the warehouse automatically when you execute SQL statements that require compute resources.

Create file format objects

When you load data files from an S3 bucket into a table, you must describe the format of the file and specify how the data in the file should be interpreted and processed. For example, if you are loading pipe-delimited data from a CSV file, you must specify that the file uses the CSV format with pipe symbols as delimiters.

When you execute the COPY INTO <table> command, you specify this format information. You can either specify this information as options in the command (e.g. TYPE = CSV, FIELD_DELIMITER = '|', etc.) or you can specify a file format object that contains this format information. You can create a named file format object using the CREATE FILE FORMAT command.

In this step, you create file format objects describing the data format of the sample CSV and JSON data provided for this tutorial.

Creating a file format object for CSV data

Execute the CREATE FILE FORMAT command to create the mycsvformat file format.

CREATE OR REPLACE FILE FORMAT mycsvformat
   TYPE = 'CSV'
   FIELD_DELIMITER = '|'
   SKIP_HEADER = 1;
Copy

Where:

  • TYPE = 'CSV' indicates the source file format type. CSV is the default file format type.

  • FIELD_DELIMITER = '|' indicates the ‘|’ character is a field separator. The default value is ‘,’.

  • SKIP_HEADER = 1 indicates the source file includes one header line. The COPY command skips these header lines when loading data. The default value is 0.

Creating a file format object for JSON data

Execute the CREATE FILE FORMAT command to create the myjsonformat file format.

CREATE OR REPLACE FILE FORMAT myjsonformat
  TYPE = 'JSON'
  STRIP_OUTER_ARRAY = TRUE;
Copy

Where:

  • TYPE = 'JSON' indicates the source file format type.

  • STRIP_OUTER_ARRAY = TRUE directs the COPY command to exclude the root brackets ([]) when loading data to the table.

Create stage objects

A stage specifies where data files are stored (i.e. “staged”) so that the data in the files can be loaded into a table. A named external stage is a cloud storage location managed by Snowflake. An external stage references data files stored in a S3 bucket. In this case, we are creating a stage that references the sample data files necessary to complete the tutorial.

Creating a named external stage is useful if you want multiple users or processes to upload files. If you plan to stage data files to load only by you, or to load only into a single table, then you may prefer to use your user stage or the table stage. For information, see Bulk loading from Amazon S3.

In this step, you create named stages for the different types of sample data files.

Creating a stage for CSV data files

Execute CREATE STAGE to create the my_csv_stage stage:

CREATE OR REPLACE STAGE my_csv_stage
  FILE_FORMAT = mycsvformat
  URL = 's3://snowflake-docs';
Copy

Creating a stage for JSON data files

Execute CREATE STAGE to create the my_json_stage stage:

CREATE OR REPLACE STAGE my_json_stage
  FILE_FORMAT = myjsonformat
  URL = 's3://snowflake-docs';
Copy

Note

In regular use, if you were creating a stage that pointed to your private data files, you would reference a storage integration created using CREATE STORAGE INTEGRATION by an account administrator (i.e. a user with the ACCOUNTADMIN role) or a role with the global CREATE INTEGRATION privilege:

CREATE OR REPLACE STAGE external_stage
  FILE_FORMAT = mycsvformat
  URL = 's3://private-bucket'
  STORAGE_INTEGRATION = myint;
Copy

Copy data into the target table

Execute COPY INTO <table> to load staged data into the target tables.

CSV

To load the data from the sample CSV files:

  1. Start by loading the data from one of the files in the /tutorials/dataloading/ prefix (folder) named contacts1.csv in the mycsvtable table. Execute the following:

    COPY INTO mycsvtable
      FROM @my_csv_stage/tutorials/dataloading/contacts1.csv
      ON_ERROR = 'skip_file';
    
    Copy

    Where:

    • The FROM clause specifies the location of the staged data file (stage name followed by the file name).

    • The ON_ERROR = 'skip_file' clause specifies what to do when the COPY command encounters errors in the files. In this case, when the command encounters a data error on any of the records in a file, it skips the file. If you do not specify an ON_ERROR clause, the default is abort_statement, which aborts the COPY command on the first error encountered on any of the records in a file.

    The COPY command returns a result showing the name of the file copied and related information:

    +---------------------------------------------------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------+
    | file                                                    | status | rows_parsed | rows_loaded | error_limit | errors_seen | first_error | first_error_line | first_error_character | first_error_column_name |
    |---------------------------------------------------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------|
    | s3://snowflake-docs/tutorials/dataloading/contacts1.csv | LOADED |           5 |           5 |           1 |           0 |        NULL |             NULL |                  NULL |                    NULL |
    +---------------------------------------------------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------+
    
    Copy
  2. Load the rest of the staged files in the mycsvtable table.

    The following example uses pattern matching to load data from files that match the regular expression .*contacts[1-5].csv into the mycsvtable table.

    COPY INTO mycsvtable
      FROM @my_csv_stage/tutorials/dataloading/
      PATTERN='.*contacts[1-5].csv'
      ON_ERROR = 'skip_file';
    
    Copy

    Where the PATTERN clause specifies that the command should load data from the filenames matching this regular expression .*contacts[1-5].csv.

    The COPY command returns a result showing the name of the file copied and related information:

    +---------------------------------------------------------+-------------+-------------+-------------+-------------+-------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------+-----------------------+-------------------------+
    | file                                                    | status      | rows_parsed | rows_loaded | error_limit | errors_seen | first_error                                                                                                                                                          | first_error_line | first_error_character | first_error_column_name |
    |---------------------------------------------------------+-------------+-------------+-------------+-------------+-------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------+-----------------------+-------------------------|
    | s3://snowflake-docs/tutorials/dataloading/contacts2.csv | LOADED      |           5 |           5 |           1 |           0 | NULL                                                                                                                                                                 |             NULL |                  NULL | NULL                    |
    | s3://snowflake-docs/tutorials/dataloading/contacts3.csv | LOAD_FAILED |           5 |           0 |           1 |           2 | Number of columns in file (11) does not match that of the corresponding table (10), use file format option error_on_column_count_mismatch=false to ignore this error |                3 |                     1 | "MYCSVTABLE"[11]        |
    | s3://snowflake-docs/tutorials/dataloading/contacts4.csv | LOADED      |           5 |           5 |           1 |           0 | NULL                                                                                                                                                                 |             NULL |                  NULL | NULL                    |
    | s3://snowflake-docs/tutorials/dataloading/contacts5.csv | LOADED      |           6 |           6 |           1 |           0 | NULL                                                                                                                                                                 |             NULL |                  NULL | NULL                    |
    +---------------------------------------------------------+-------------+-------------+-------------+-------------+-------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------+-----------------------+-------------------------+
    
    Copy

    Note the following highlights in the result:

    • The data in contacts1.csv is ignored because you already loaded the data successfully.

    • The data in these files was loaded successfully: contacts2.csv, contacts4.csv, and contacts5.csv.

    • The data in contacts3.csv was skipped due to 2 data errors. The next step in this tutorial addresses how to validate and fix the errors.

JSON

Load the contacts.json staged data file into the myjsontable table.

COPY INTO myjsontable
  FROM @my_json_stage/tutorials/dataloading/contacts.json
  ON_ERROR = 'skip_file';
Copy

The COPY returns a result showing the name of the file copied and related information:

+---------------------------------------------------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------+
| file                                                    | status | rows_parsed | rows_loaded | error_limit | errors_seen | first_error | first_error_line | first_error_character | first_error_column_name |
|---------------------------------------------------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------|
| s3://snowflake-docs/tutorials/dataloading/contacts.json | LOADED |           3 |           3 |           1 |           0 |        NULL |             NULL |                  NULL |                    NULL |
+---------------------------------------------------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------+
Copy

Clean up

Congratulations, you have successfully completed the tutorial.

Tutorial clean up (optional)

Execute the following DROP <object> commands to return your system to its state before you began the tutorial:

DROP DATABASE IF EXISTS mydatabase;
DROP WAREHOUSE IF EXISTS mywarehouse;
Copy

Dropping the database automatically removes all child database objects such as tables.

Other data loading tutorials