For this tutorial you need to download the sample data files provided by Snowflake.
To download and unzip the sample data files:
Right-click the name of the
archive file, data-load-internal.zip
and save the link/file to your local file system.
Unzip the sample files. The tutorial assumes you unpacked files in
to the following directories:
Linux/macOS: /tmp/load
Windows: C:\tempload
These data files include sample contact data in the following formats:
CSV files that contain a header row and five records. The field
delimiter is the pipe (|) character.
The following example shows a header row and one record:
Execute the following statements to create a database, two tables
(for csv and json data), and a virtual warehouse needed for this tutorial.
After you complete the tutorial, you can drop these objects.
-- Create a database. A database automatically includes a schema named 'public'.CREATEORREPLACEDATABASE mydatabase;/* Create target tables for CSV and JSON data. The tables are temporary, meaning they persist only for the duration of the user session and are not visible to other users. */CREATEORREPLACETEMPORARYTABLE mycsvtable (
id INTEGER,last_nameSTRING,first_nameSTRING,
company STRING,emailSTRING,
workphone STRING,
cellphone STRING,
streetaddress STRING,
city STRING,
postalcode STRING);CREATEORREPLACETEMPORARYTABLE myjsontable (
json_data VARIANT);-- Create a warehouseCREATEORREPLACEWAREHOUSE mywarehouse WITHWAREHOUSE_SIZE='X-SMALL'AUTO_SUSPEND=120AUTO_RESUME=TRUEINITIALLY_SUSPENDED=TRUE;
The CREATE WAREHOUSE statement sets up the warehouse to be suspended initially.
The statement also sets AUTO_RESUME = true, which starts the warehouse automatically
when you execute SQL statements that require compute resources.
When you load data from a file into a table, you must describe the format of the file
and specify how the data in the file should be interpreted and processed. For example,
if you are loading pipe-delimited data from a CSV file, you must specify that the file
uses the CSV format with pipe symbols as delimiters.
When you execute the COPY INTO <table> command, you specify this format information. You can
either specify this information as options in the command (e.g.
TYPE = CSV, FIELD_DELIMITER = '|', etc.) or you can specify a
file format object that contains this format information. You can create a named file
format object using the CREATE FILE FORMAT command.
In this step, you create file format objects describing the data format of the sample CSV and
JSON data provided for this tutorial.
TYPE = 'CSV' indicates the source file format type. CSV is the default file format type.
FIELD_DELIMITER = '|' indicates the ‘|’ character is a field separator. The default value is ‘,’.
SKIP_HEADER = 1 indicates the source file includes one header line. The COPY command skips these header lines when loading data. The default value is 0.
A stage specifies where data files are stored (i.e. “staged”) so that the data
in the files can be loaded into a table.
A named internal stage
is a cloud storage location managed by Snowflake.
Creating a named stage is useful if you want multiple users or processes
to upload files. If you plan to stage data files to load only
by you, or to load only into a single table, then you may prefer
to use your user stage or the table stage. For information, see
Bulk loading from a local file system.
In this step, you create named stages for the different types of sample data files.
Note that if you specify the FILE_FORMAT option when creating
the stage, it is not necessary to specify the same FILE_FORMAT
option in the COPY command used to load data from the stage.
file://<file-path>[/]contacts*.csv specifies the full directory path and names of the files on your local machine to stage. Note that file system wildcards are allowed.
@my_csv_stage is the stage name where to stage the data.
auto_compress=true; directs the command to compress the data when staging. This is also the default.
The command returns the following result, showing the staged files:
The FROM clause specifies the location of the staged data
file (stage name followed by the file name).
The ON_ERROR clause specifies what to do when the COPY command
encounters errors in the files. By default, the command stops
loading data when the first error is encountered; however, we’ve instructed it to skip any file containing an error and move on to loading the next file. Note that this is just for illustration purposes; none of the files in this tutorial contain errors.
The COPY command returns a result showing the name of the file copied and related information:
Load the rest of the staged files in the mycsvtable table.
The following example uses pattern matching to load data from all files
that match the regular expression .*contacts[1-5].csv.gz into the mycsvtable table.
In the preceding step, the COPY INTO command skipped loading one of the files when
it encountered the first error. You need to find all the errors and fix them.
In this step, you use the VALIDATE function
to validate the previous execution of the COPY INTO command and returns all errors.
Validate the sample data files and retrieve any errors¶
You first need the query ID associated with the COPY INTO command
that you previously executed. You then call the VALIDATE function,
specifying the query ID.
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------+------+-----------+-------------+----------+--------+-----------+-------------------------------+------------+----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+|ERROR|FILE| LINE |CHARACTER| BYTE_OFFSET | CATEGORY | CODE | SQL_STATE | COLUMN_NAME |ROW_NUMBER| ROW_START_LINE | REJECTED_RECORD ||----------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------+------+-----------+-------------+----------+--------+-----------+-------------------------------+------------+----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------||Numberofcolumnsinfile(11) does notmatch that of the corresponding table(10),usefileformatoptionerror_on_column_count_mismatch=falsetoignore this error| mycsvtable/contacts3.csv.gz |3|1|234| parsing |100080|22000|"MYCSVTABLE"[11]|1|2|11%Ishmael%Burnett|Dolor Elit Pellentesque Ltd|vitae.erat@necmollisvitae.ca%1-872%600-7301%1-513-592-6779%P.O. Box 975,553 Odio, Road%Hulste%63345|| Field delimiter'|' found while expecting record delimiter'\n'| mycsvtable/contacts3.csv.gz |5|125|625| parsing |100016|22000|"MYCSVTABLE"["POSTALCODE":10]|4|5|14|Sophia%Christian%Turpis Ltd|lectus.pede@non.ca|1-962-503-3253%1-157-%850-3602|P.O. Box 824,7971 Sagittis Rd.|Chattanooga|56188|+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------+------+-----------+-------------+----------+--------+-----------+-------------------------------+------------+----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+
The result shows two data errors in mycsvtable/contacts3.csv.gz:
Number of columns in file (11) does not match that of the corresponding table (10)
In Row 1, a hyphen was mistakenly replaced with the pipe (|) character, the data file delimiter, effectively creating an additional column in the record.
Field delimiter '|' found while expecting record delimiter 'n'
In Row 5, an additional pipe (|) character was introduced after a hyphen, breaking the record.
After you verify that you successfully copied data from your stage into the tables,
you can remove data files from the internal stage using the REMOVE
command to save on data storage.