You have a Snowflake account that is configured to use Amazon AWS and a user with
a role that grants the necessary privileges to create a database, tables, and
virtual warehouse objects.
The Snowflake in 20 minutes tutorial provides the related
step-by-step instructions to meet these requirements.
Snowflake provides sample data files in a public S3 bucket for use in this tutorial.
But before you start, you need to create a database, tables, a virtual warehouse,
and an external stage for this tutorial. These are the basic Snowflake objects
needed for most Snowflake activities.
The data represents sample events that applications upload to S3. A variety of devices and applications, such as servers, cell phones, and browsers publish events. In a common data
collection scenario, a scalable web endpoint collects POSTed data from different sources and writes them to a queuing
system. An ingest service/utility then writes the data to a S3
bucket, from which you can load the data into Snowflake.
The sample data illustrates the following concepts:
Applications can choose to group events in batches. A batch is a container
that holds header information common to all of the events in the batch. For example, the preceding JSON is a batch of two
events with common header information: device_type and version that generated these events.
Amazon S3 supports using folders concept to organize a bucket. Applications can leverage this feature to partition event data.
Partitioning schemes typically identify details, such as application or location that generated the event, along with
an event date when it was written to S3. Such a partitioning scheme enables you to copy any fraction of the partitioned
data to Snowflake with a single COPY command. For example, you can copy event data by the hour, data, month, or year
when you initially populate tables.
Note the application_a, application_b, location_c, etc. identify details for the source
of all data in the path. The data can be organized by the date when it was written.
An optional 24-hour directory reduces the amount of data in each directory.
Note
S3 transmits a directory list with each COPY statement used by Snowflake, so reducing
the number of files in each directory improves the performance of your COPY statements.
You may even consider creating 10-15 minute increment folders in each hour.
The sample data provided in the S3 bucket uses a similar partitioning scheme. In a COPY command you
will specify a specific folder path to copy events data.
Creating the database, table, warehouse, and external stage¶
Execute the following statements to create a database, a table, a virtual warehouse,
and an external stage needed for this tutorial. After you complete the tutorial,
you can drop these objects.
The CREATE DATABASE statement creates a database. The database automatically
includes a schema named ‘public’.
The USE SCHEMA statement specifies an active database and schema for the current user session.
Specifying a database now enables you to perform your work in this database without having
to provide the name each time it is requested.
The CREATE TABLE statement creates a target table for JSON data.
The CREATE WAREHOUSE statement creates an initially suspended warehouse. The
statement also sets AUTO_RESUME = true, which starts the warehouse automatically
when you execute SQL statements that require compute resources.
The USE WAREHOUSE statement specifies the warehouse you created as the active
warehouse for the current user session.
The CREATE STAGE statement creates an external stage that points to the S3 bucket
containing the sample file for this tutorial.
The command copies all new data from the specified path on the external stage
to the target RAW_SOURCE table. In this example, the specified path targets data
written on the 15th hour (3 PM) of July 15th, 2016.
Note that Snowflake checks each file’s S3 ETag value to ensure it is
copied only once.
Execute a SELECT query to verify the data is copied successfully.
In this sample JSON data, there are two events. The device_type,
and version key values identify a data source and version for
events from a specific device.
The query uses the src:device_type notation
to specify the column name and the JSON element name to retrieve. This
notation is similar to the
familiar SQL table.column notation.
Snowflake allows you to specify a
sub-column within a parent column, which Snowflake dynamically derives from the
schema definition embedded in the JSON data. For more information,
refer to Querying Semi-structured Data.
Note
The column name is case-insensitive, however JSON element names
are case-sensitive.
Retrieve the device_type value without the quotes.
The preceding query returns the JSON data value in quote. You can remove
the quotes by casting the data to a specific data type,
in this example a string.
This query also optionally assigns a name to the column using an alias.
SELECT src:device_type::stringAS device_type
FROM raw_source;
FLATTEN is a table function that produces a lateral
view of a VARIANT, OBJECT, or ARRAY column. In this step, you use this function
to explore different levels of flattening.
You can flatten the event objects in the events array into separate rows
using the FLATTEN function. The function output includes a
VALUE column that stores these individual events.
You can then use the LATERAL modifier to join the FLATTEN function output
with any information outside of the object — in this example,
the device_type and version.
In the preceding example, you flattened the event objects in the events array
into separate rows. The resulting flattened_source table retained the event structure
in the src column of the VARIANT type.
One benefit of retaining the
event objects in the src column of the VARIANT type is that when event format changes,
you don’t have to recreate and repopulate such tables. But you also have the option to
copy individual keys in the event object into separate typed columns as shown
in the following query.
The following CREATE TABLE AS SELECT statement creates a new table named events with the event
object keys stored in separate columns. Each value is cast to a data type that is appropriate
for the value, using a double-colon (::) followed by the type. If you omit the casting,
the column assumes the VARIANT data type, which can hold any value:
createorreplacetable events asselect
src:device_type::stringas device_type
, src:version::stringasversion,value:f::numberas f
,value:rv::variantas rv
,value:t::numberas t
,value:v.ACHZ::numberas achz
,value:v.ACV::numberas acv
,value:v.DCA::numberas dca
,value:v.DCV::numberas dcv
,value:v.ENJR::numberas enjr
,value:v.ERRS::numberas errs
,value:v.MXEC::numberas mxec
,value:v.TMPI::numberas tmpi
,value:vd::numberas vd
,value:z::numberas z
from
raw_source
,lateralflatten(input=> SRC:events);
The statement flattens the nested data in the EVENTS.SRC:V key, adding a separate column for each value.
The statement outputs a row for each key/value pair. The following output shows the first two records in the new events table:
SELECT*FROM events;+-------------+---------+---------+----------------------------------------------------------------------+---------------+-------+--------+-----+-------+------+------+------+------+-----+---------------+| DEVICE_TYPE |VERSION| F | RV | T | ACHZ | ACV | DCA | DCV | ENJR | ERRS | MXEC | TMPI | VD | Z ||-------------+---------+---------+----------------------------------------------------------------------+---------------+-------+--------+-----+-------+------+------+------+------+-----+---------------||server|2.6|83|"15219.64,783.63,48674.48,84679.52,27499.78,2178.83,0.42,74900.19"|1437560931139|42869|709489|232|62287|2599|205|487|9|54|1437644222811||server|2.6|1000083|"8070.52,54470.71,85331.27,9.10,70825.85,65191.82,46564.53,29422.22"|1437036965027|6953|346795|250|46066|9033|615|0|112|626|1437660796958|+-------------+---------+---------+----------------------------------------------------------------------+---------------+-------+--------+-----+-------+------+------+------+------+-----+---------------+
Copied sample JSON event data from an S3 bucket into the RAW_SOURCE table
and explored simple queries.
You also explored the FLATTEN function to flatten the JSON data and obtain a relational
representation of the data. For example, you extracted event keys and stored the keys
in separate columns in another EVENTS table.
At the beginning, the tutorial explains the application scenario where multiple sources generate
events and a web endpoint saves it to your S3 bucket. As new events are added to the S3 bucket,
you might use a script to continuously copy new data into the RAW_SOURCE table.
But how do insert only new event data into the EVENTS table.
There are numerous ways to maintain data consistency. This section explains two options.
In this section you add a primary key to the EVENTS table. The primary key then guarantees uniqueness.
Examine your JSON data for any values that are naturally unique and would be good
candidates for a primary key. For example, assume that the combination of
src:device_type and value:rv can be a primary key. These two JSON keys
correspond to the DEVICE_TYPE and RV columns in the EVENTS table.
Note
Snowflake does not enforce the primary key constraint. Rather, the constraint
serves as metadata that identifies the natural key in the Information Schema.
Add the primary key constraint to the EVENTS table:
If the JSON data does not have fields that can be primary key candidates, you
could compare all repeating JSON keys in the RAW_SOURCE table with the
corresponding column values in the EVENTS table.
No changes to your existing EVENTS table are required.
Insert a new JSON event record into the RAW_SOURCE table:
Partitioning the event data in your S3 bucket using logical, granular paths allows you to copy a subset of the partitioned data into Snowflake with a single command.
Snowflake’s column:key notation, similar to the familiar SQL table.column notation,
allows you to effectively query a column within the column (i.e., a sub-column), which is
dynamically derived based on the schema definition embedded in the JSON data.
The FLATTEN function allows you to parse JSON data into separate columns.
Several options are available to update table data based on comparisons with staged data files.