Tutorial: JSON basics for Snowflake¶
Introduction¶
In this tutorial you will learn the basics of using JSON with Snowflake.
What you will learn¶
In this tutorial, you learn how to do the following:
- Upload sample JSON data from a public S3 bucket into a column of the
varianttype in a Snowflake table. - Test simple queries for JSON data in the table.
- Explore the FLATTEN function to flatten JSON data into a relational representation and save it in another table.
- Explore ways to ensure uniqueness as you insert rows in the flattened version of the data.
Prerequisites¶
The tutorial assumes the following:
- You have a Snowflake account that is configured to use Amazon AWS and a user with a role that grants the necessary privileges to create a database, tables, and virtual warehouse objects.
- You have SnowSQL (CLI client) installed.
The Snowflake in 20 minutes tutorial provides the related step-by-step instructions to meet these requirements.
Snowflake provides sample data files in a public S3 bucket for use in this tutorial. But before you start, you need to create a database, tables, a virtual warehouse, and an external stage for this tutorial. These are the basic Snowflake objects needed for most Snowflake activities.
About the sample data file¶
For this tutorial, you use the following sample application events JSON data provided in a public S3 bucket.
The data represents sample events that applications upload to S3. A variety of devices and applications, such as servers, cell phones, and browsers publish events. In a common data collection scenario, a scalable web endpoint collects POSTed data from different sources and writes them to a queuing system. An ingest service/utility then writes the data to a S3 bucket, from which you can load the data into Snowflake.
The sample data illustrates the following concepts:
-
Applications can choose to group events in batches. A batch is a container that holds header information common to all of the events in the batch. For example, the preceding JSON is a batch of two events with common header information:
device_typeandversionthat generated these events. -
Amazon S3 supports using folders concept to organize a bucket. Applications can leverage this feature to partition event data. Partitioning schemes typically identify details, such as application or location that generated the event, along with an event date when it was written to S3. Such a partitioning scheme enables you to copy any fraction of the partitioned data to Snowflake with a single COPY command. For example, you can copy event data by the hour, data, month, or year when you initially populate tables.
For example:
s3://bucket_name/application_a/2016/07/01/11/s3://bucket_name/application_b/location_c/2016/07/01/14/Note the
application_a,application_b,location_c, etc. identify details for the source of all data in the path. The data can be organized by the date when it was written. An optional 24-hour directory reduces the amount of data in each directory.Note
S3 transmits a directory list with each COPY statement used by Snowflake, so reducing the number of files in each directory improves the performance of your COPY statements. You may even consider creating 10-15 minute increment folders in each hour.
The sample data provided in the S3 bucket uses a similar partitioning scheme. In a COPY command you will specify a specific folder path to copy events data.
Creating the database, table, warehouse, and external stage¶
Execute the following statements to create a database, a table, a virtual warehouse, and an external stage needed for this tutorial. After you complete the tutorial, you can drop these objects.
Note the following:
- The
CREATE DATABASEstatement creates a database. The database automatically includes a schema named ‘public’. - The
USE SCHEMAstatement specifies an active database and schema for the current user session. Specifying a database now enables you to perform your work in this database without having to provide the name each time it is requested. - The
CREATE TABLEstatement creates a target table for JSON data. - The
CREATE WAREHOUSEstatement creates an initially suspended warehouse. The statement also sets AUTO_RESUME = true, which starts the warehouse automatically when you execute SQL statements that require compute resources. TheUSE WAREHOUSEstatement specifies the warehouse you created as the active warehouse for the current user session. - The
CREATE STAGEstatement creates an external stage that points to the S3 bucket containing the sample file for this tutorial.
Copy data into the target table¶
Execute COPY INTO <table> to load your staged data into
the target RAW_SOURCE table.
The command copies all new data from the specified path on the external stage
to the target RAW_SOURCE table. In this example, the specified path targets data
written on the 15th hour (3 PM) of July 15th, 2016.
Note that Snowflake checks each file’s S3 ETag value to ensure it is
copied only once.
Execute a SELECT query to verify the data is copied successfully.
The query returns the following result:
In this sample JSON data, there are two events. The device_type,
and version key values identify a data source and version for
events from a specific device.
Query data¶
In this section, you explore SELECT statements to query the JSON data.
-
Retrieve
device_type.The query return the following result:
The query uses the
src:device_typenotation to specify the column name and the JSON element name to retrieve. This notation is similar to the familiar SQLtable.columnnotation. Snowflake allows you to specify a sub-column within a parent column, which Snowflake dynamically derives from the schema definition embedded in the JSON data. For more information, refer to Querying Semi-structured Data.Note
The column name is case-insensitive, however JSON element names are case-sensitive.
-
Retrieve the
device_typevalue without the quotes.The preceding query returns the JSON data value in quote. You can remove the quotes by casting the data to a specific data type, in this example a string.
This query also optionally assigns a name to the column using an alias.
The query returns the following result:
-
Retrieve repeating
fkeys nested within the array event objects.The sample JSON data includes
eventsarray. Each event object in the array has theffield as shown.To retrieve these nested keys, you can use the FLATTEN function. The function flattens the events into separate rows.
The query returns the following result:
Note the
valueis one of the columns that FLATTEN function returns. The next step provides more details about using the FLATTEN function.
Flatten data¶
FLATTEN is a table function that produces a lateral view of a VARIANT, OBJECT, or ARRAY column. In this step, you use this function to explore different levels of flattening.
Flatten array objects in a variant column¶
You can flatten the event objects in the events array into separate rows
using the FLATTEN function. The function output includes a
VALUE column that stores these individual events.
You can then use the LATERAL modifier to join the FLATTEN function output
with any information outside of the object — in this example,
the device_type and version.
-
Query the data for each event:
The query returns the following result:
-
Use a CREATE TABLE AS SELECT statement to store the preceding query result in a table:
Query the resulting table.
The query returns the following result:
Flatten object keys in separate columns¶
In the preceding example, you flattened the event objects in the events array
into separate rows. The resulting flattened_source table retained the event structure
in the src column of the VARIANT type.
One benefit of retaining the
event objects in the src column of the VARIANT type is that when event format changes,
you don’t have to recreate and repopulate such tables. But you also have the option to
copy individual keys in the event object into separate typed columns as shown
in the following query.
The following CREATE TABLE AS SELECT statement creates a new table named events with the event
object keys stored in separate columns. Each value is cast to a data type that is appropriate
for the value, using a double-colon (::) followed by the type. If you omit the casting,
the column assumes the VARIANT data type, which can hold any value:
The statement flattens the nested data in the EVENTS.SRC:V key, adding a separate column for each value.
The statement outputs a row for each key/value pair. The following output shows the first two records in the new events table:
Update data¶
So far in this tutorial, you did the following:
- Copied sample JSON event data from an S3 bucket into the
RAW_SOURCEtable and explored simple queries. - You also explored the FLATTEN function to flatten the JSON data and obtain a relational representation of the data. For example, you extracted event keys and stored the keys in separate columns in another EVENTS table.
At the beginning, the tutorial explains the application scenario where multiple sources generate
events and a web endpoint saves it to your S3 bucket. As new events are added to the S3 bucket,
you might use a script to continuously copy new data into the RAW_SOURCE table.
But how do insert only new event data into the EVENTS table.
There are numerous ways to maintain data consistency. This section explains two options.
Use primary key columns for comparison¶
In this section you add a primary key to the EVENTS table. The primary key then guarantees uniqueness.
-
Examine your JSON data for any values that are naturally unique and would be good candidates for a primary key. For example, assume that the combination of
src:device_typeandvalue:rvcan be a primary key. These two JSON keys correspond to theDEVICE_TYPEandRVcolumns in theEVENTStable.Note
Snowflake does not enforce the primary key constraint. Rather, the constraint serves as metadata that identifies the natural key in the Information Schema.
-
Add the primary key constraint to the
EVENTStable: -
Insert a new JSON event record into the
RAW_SOURCEtable: -
Insert the new record that you added to the
RAW_SOURCEtable into theEVENTStable based on a comparison of the primary key values:Querying the
EVENTStable shows the added row:The query returns the following result:
Use all columns for comparison¶
If the JSON data does not have fields that can be primary key candidates, you
could compare all repeating JSON keys in the RAW_SOURCE table with the
corresponding column values in the EVENTS table.
No changes to your existing EVENTS table are required.
-
Insert a new JSON event record into the
RAW_SOURCEtable: -
Insert the new record in the
RAW_SOURCEtable into theEVENTStable based on a comparison of all repeating key values:Querying the
EVENTStable shows the added row:The query returns the following result:
Congratulations¶
Congratulations, you have successfully completed the tutorial.
Tutorial key points¶
- Partitioning the event data in your S3 bucket using logical, granular paths allows you to copy a subset of the partitioned data into Snowflake with a single command.
- Snowflake’s
column:keynotation, similar to the familiar SQLtable.columnnotation, allows you to effectively query a column within the column (i.e., a sub-column), which is dynamically derived based on the schema definition embedded in the JSON data. - The FLATTEN function allows you to parse JSON data into separate columns.
- Several options are available to update table data based on comparisons with staged data files.
Tutorial clean up (optional)¶
Execute the following DROP <object> commands to return your system to its state before you began the tutorial:
Dropping the database automatically removes all child database objects such as tables.