Prerequisites

The tutorial assumes the following:

  • You have a Snowflake account that is configured to use Amazon AWS and a user with a role that grants the necessary privileges to create a database, tables, and virtual warehouse objects.

  • You have SnowSQL (CLI client) installed.

The Snowflake in 20 minutes tutorial provides the related step-by-step instructions to meet these requirements.

Snowflake provides sample data files in a public S3 bucket for use in this tutorial. But before you start, you need to create a database, tables, a virtual warehouse, and an external stage for this tutorial. These are the basic Snowflake objects needed for most Snowflake activities.

About the Sample Data File

For this tutorial, you use the following sample application events JSON data provided in a public S3 bucket.

{
"device_type": "server",
"events": [
  {
    "f": 83,
    "rv": "15219.64,783.63,48674.48,84679.52,27499.78,2178.83,0.42,74900.19",
    "t": 1437560931139,
    "v": {
      "ACHZ": 42869,
      "ACV": 709489,
      "DCA": 232,
      "DCV": 62287,
      "ENJR": 2599,
      "ERRS": 205,
      "MXEC": 487,
      "TMPI": 9
    },
    "vd": 54,
    "z": 1437644222811
  },
  {
    "f": 1000083,
    "rv": "8070.52,54470.71,85331.27,9.10,70825.85,65191.82,46564.53,29422.22",
    "t": 1437036965027,
    "v": {
      "ACHZ": 6953,
      "ACV": 346795,
      "DCA": 250,
      "DCV": 46066,
      "ENJR": 9033,
      "ERRS": 615,
      "MXEC": 0,
      "TMPI": 112
    },
    "vd": 626,
    "z": 1437660796958
  }
],
"version": 2.6
}
Copy

The data represents sample events that applications upload to S3. A variety of devices and applications, such as servers, cell phones, and browsers publish events. In a common data collection scenario, a scalable web endpoint collects POSTed data from different sources and writes them to a queuing system. An ingest service/utility then writes the data to a S3 bucket, from which you can load the data into Snowflake.

The sample data illustrates the following concepts:

  • Applications can choose to group events in batches. A batch is a container that holds header information common to all of the events in the batch. For example, the preceding JSON is a batch of two events with common header information: device_type and version that generated these events.

  • Amazon S3 supports using folders concept to organize a bucket. Applications can leverage this feature to partition event data. Partitioning schemes typically identify details, such as application or location that generated the event, along with an event date when it was written to S3. Such a partitioning scheme enables you to copy any fraction of the partitioned data to Snowflake with a single COPY command. For example, you can copy event data by the hour, data, month, or year when you initially populate tables.

    For example:

    s3://bucket_name/application_a/2016/07/01/11/

    s3://bucket_name/application_b/location_c/2016/07/01/14/

    Note the application_a, application_b, location_c, etc. identify details for the source of all data in the path. The data can be organized by the date when it was written. An optional 24-hour directory reduces the amount of data in each directory.

    Note

    S3 transmits a directory list with each COPY statement used by Snowflake, so reducing the number of files in each directory improves the performance of your COPY statements. You may even consider creating 10-15 minute increment folders in each hour.

    The sample data provided in the S3 bucket uses a similar partitioning scheme. In a COPY command you will specify a specific folder path to copy events data.

Creating the Database, table, Warehouse, and External Stage

Execute the following statements to create a database, a table, a virtual warehouse, and an external stage needed for this tutorial. After you complete the tutorial, you can drop these objects.

CREATE OR REPLACE DATABASE mydatabase;

USE SCHEMA mydatabase.public;

CREATE OR REPLACE TABLE raw_source (
  SRC VARIANT);

CREATE OR REPLACE WAREHOUSE mywarehouse WITH
  WAREHOUSE_SIZE='X-SMALL'
  AUTO_SUSPEND = 120
  AUTO_RESUME = TRUE
  INITIALLY_SUSPENDED=TRUE;

USE WAREHOUSE mywarehouse;

CREATE OR REPLACE STAGE my_stage
  URL = 's3://snowflake-docs/tutorials/json';
Copy

Note the following:

  • The CREATE DATABASE statement creates a database. The database automatically includes a schema named ‘public’.

  • The USE SCHEMA statement specifies an active database and schema for the current user session. Specifying a database now enables you to perform your work in this database without having to provide the name each time it is requested.

  • The CREATE TABLE statement creates a target table for JSON data.

  • The CREATE WAREHOUSE statement creates an initially suspended warehouse. The statement also sets AUTO_RESUME = true, which starts the warehouse automatically when you execute SQL statements that require compute resources. The USE WAREHOUSE statement specifies the warehouse you created as the active warehouse for the current user session.

  • The CREATE STAGE statement creates an external stage that points to the S3 bucket containing the sample file for this tutorial.

Next: Step 1. Copy Data Into the Target Table