This tutorial describes how to load data from files in an existing Amazon Simple Storage Service (Amazon S3) bucket into a table. In this tutorial, you will learn how to:
Create named file formats that describe your data files.
Create named stage objects.
Load data located in your S3 bucket into Snowflake tables.
Resolve errors in your data files.
The tutorial covers loading of both CSV and JSON data.
You have a Snowflake account that is configured to use Amazon Web Services (AWS) and a user with a role that grants the necessary
privileges to create a database, tables, and virtual warehouse objects.
Snowflake provides sample data files in a public Amazon S3 bucket for use in this tutorial.
But before you start, you need to create a database, tables, and a virtual warehouse for
this tutorial. These are the basic Snowflake objects needed for most Snowflake activities.
Snowflake provides sample data files staged in a public S3 bucket.
Note
In regular use, you would stage your own data files using the AWS Management Console, AWS Command
Line Interface, or an equivalent client application. See the
Amazon Web Services documentation for instructions.
The sample data files include sample contact information in the following formats:
CSV files that contain a header row and five records. The field delimiter is the pipe (|) character.
The following example shows a header row and one record:
Execute the following statements to create a database, two tables (for csv and json data),
and a virtual warehouse needed for this tutorial. After you complete the tutorial, you can
drop these objects.
CREATEORREPLACEDATABASE mydatabase;CREATEORREPLACETEMPORARYTABLE mycsvtable (
id INTEGER,last_nameSTRING,first_nameSTRING,
company STRING,emailSTRING,
workphone STRING,
cellphone STRING,
streetaddress STRING,
city STRING,
postalcode STRING);CREATEORREPLACETEMPORARYTABLE myjsontable (
json_data VARIANT);CREATEORREPLACEWAREHOUSE mywarehouse WITHWAREHOUSE_SIZE='X-SMALL'AUTO_SUSPEND=120AUTO_RESUME=TRUEINITIALLY_SUSPENDED=TRUE;
Note the following:
The CREATE DATABASE statement creates a database. The database automatically includes a schema named ‘public’.
The CREATE TABLE statements create target tables for CSV and JSON data. The tables are temporary, that is, they
persist only for the duration of the user session and are not visible to other users.
The CREATE WAREHOUSE statement creates an initially suspended warehouse. The
statement also sets AUTO_RESUME = true, which starts the warehouse automatically when
you execute SQL statements that require compute resources.
When you load data files from an S3 bucket into a table, you must describe the format of the file
and specify how the data in the file should be interpreted and processed. For example,
if you are loading pipe-delimited data from a CSV file, you must specify that the file
uses the CSV format with pipe symbols as delimiters.
When you execute the COPY INTO <table> command, you specify this format information. You can
either specify this information as options in the command (e.g.
TYPE = CSV, FIELD_DELIMITER = '|', etc.) or you can specify a
file format object that contains this format information. You can create a named file
format object using the CREATE FILE FORMAT command.
In this step, you create file format objects describing the data format of the sample CSV and
JSON data provided for this tutorial.
TYPE = 'CSV' indicates the source file format type. CSV is the default file format type.
FIELD_DELIMITER = '|' indicates the ‘|’ character is a field separator. The default value is ‘,’.
SKIP_HEADER = 1 indicates the source file includes one header line. The COPY command skips these header lines when loading data. The default value is 0.
A stage specifies where data files are stored (i.e. “staged”) so that the data
in the files can be loaded into a table.
A named external stage
is a cloud storage location managed by Snowflake.
An external stage references data files stored in a S3 bucket. In this case, we are creating a
stage that references the sample data files necessary to complete the tutorial.
Creating a named external stage is useful if you want multiple users or processes
to upload files. If you plan to stage data files to load only
by you, or to load only into a single table, then you may prefer
to use your user stage or the table stage. For information, see
Bulk loading from Amazon S3.
In this step, you create named stages for the different types of sample data files.
In regular use, if you were creating a stage that pointed to your private data files, you would reference a storage integration created using CREATE STORAGE INTEGRATION by an account administrator (i.e. a user with the ACCOUNTADMIN role) or a role with the global CREATE INTEGRATION privilege:
Start by loading the data from one of the files
in the /tutorials/dataloading/ prefix (folder) named contacts1.csv
in the mycsvtable table.
Execute the following:
The FROM clause specifies the location of the staged data
file (stage name followed by the file name).
The ON_ERROR = 'skip_file' clause specifies what to do when the COPY command encounters errors
in the files. In this case, when the command encounters a data error on any of the records
in a file, it skips the file. If you do not specify an ON_ERROR clause, the default
is abort_statement, which aborts the COPY command on the first error
encountered on any of the records in a file.
The COPY command returns a result showing the name of the file copied and related information:
In the preceding step, the COPY INTO command skipped loading one of the files when
it encountered the first error. You need to find all the errors.
In this step, you use the VALIDATE function
to validate the previous execution of the COPY INTO command and return all errors.
Validate the sample data files and retrieve any errors¶
You first need the retrieve query ID associated with the COPY INTO command
that you previously executed. You then call the VALIDATE function,
specifying the query ID.
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------+------+-----------+-------------+----------+--------+-----------+-------------------------------+------------+----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+|ERROR|FILE| LINE |CHARACTER| BYTE_OFFSET | CATEGORY | CODE | SQL_STATE | COLUMN_NAME |ROW_NUMBER| ROW_START_LINE | REJECTED_RECORD ||----------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------+------+-----------+-------------+----------+--------+-----------+-------------------------------+------------+----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------||Numberofcolumnsinfile(11) does notmatch that of the corresponding table(10),usefileformatoptionerror_on_column_count_mismatch=falsetoignore this error| mycsvtable/contacts3.csv.gz |3|1|234| parsing |100080|22000|"MYCSVTABLE"[11]|1|2|11%Ishmael%Burnett|Dolor Elit Pellentesque Ltd|vitae.erat@necmollisvitae.ca%1-872%600-7301%1-513-592-6779%P.O. Box 975,553 Odio, Road%Hulste%63345|| Field delimiter'|' found while expecting record delimiter'\n'| mycsvtable/contacts3.csv.gz |5|125|625| parsing |100016|22000|"MYCSVTABLE"["POSTALCODE":10]|4|5|14|Sophia%Christian%Turpis Ltd|lectus.pede@non.ca|1-962-503-3253%1-157-%850-3602|P.O. Box 824,7971 Sagittis Rd.|Chattanooga|56188|+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------+------+-----------+-------------+----------+--------+-----------+-------------------------------+------------+----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+
The result shows two data errors in mycsvtable/contacts3.csv.gz:
Number of columns in file (11) does not match that of the corresponding table (10)
In Row 1, a hyphen was mistakenly replaced with the pipe (|) character, the data file delimiter, effectively creating an additional column in the record.
Field delimiter '|' found while expecting record delimiter 'n'
In Row 5, an additional pipe (|) character was introduced after a hyphen, breaking the record.
In regular use, you would fix the problematic records manually and write them to a new data file.
You would then stage the fixed data files to the S3 bucket and attempt to reload the data from
the files. For this tutorial, you are using Snowflake provided sample data, which you do not correct.