Querying Semi-structured Data

This topic explains how to use special operators and functions to query complex hierarchical data stored in a VARIANT.

(For simple examples of how to extract values from ARRAYs and OBJECTs, see Accessing Elements of an ARRAY by Index or by Slice and Accessing Elements of an OBJECT by Key.)

Typically, hierarchical data has been imported into a VARIANT from one of the following supported data formats:

  • JSON

  • Avro

  • ORC

  • Parquet

(For information about querying XML data (e.g. data that originated in XML data format and was converted to an OBJECT by calling PARSE_XML), see XMLGET.)

In this Topic:

Sample Data Used in Examples

Except where noted, the examples in this topic refer to a table named car_sales that contains a single VARIANT column named src. This VARIANT contains nested ARRAYs and OBJECTs.

Create the table and load it:

CREATE OR REPLACE TABLE car_sales
( 
  src variant
)
AS
SELECT PARSE_JSON(column1) AS src
FROM VALUES
('{ 
    "date" : "2017-04-28", 
    "dealership" : "Valley View Auto Sales",
    "salesperson" : {
      "id": "55",
      "name": "Frank Beasley"
    },
    "customer" : [
      {"name": "Joyce Ridgely", "phone": "16504378889", "address": "San Francisco, CA"}
    ],
    "vehicle" : [
      {"make": "Honda", "model": "Civic", "year": "2017", "price": "20275", "extras":["ext warranty", "paint protection"]}
    ]
}'),
('{ 
    "date" : "2017-04-28", 
    "dealership" : "Tindel Toyota",
    "salesperson" : {
      "id": "274",
      "name": "Greg Northrup"
    },
    "customer" : [
      {"name": "Bradley Greenbloom", "phone": "12127593751", "address": "New York, NY"}
    ],
    "vehicle" : [
      {"make": "Toyota", "model": "Camry", "year": "2017", "price": "23500", "extras":["ext warranty", "rust proofing", "fabric protection"]}  
    ]
}') v;

Select the data:

SELECT * FROM car_sales;
+-------------------------------------------+
| SRC                                       |
|-------------------------------------------|
| {                                         |
|   "customer": [                           |
|     {                                     |
|       "address": "San Francisco, CA",     |
|       "name": "Joyce Ridgely",            |
|       "phone": "16504378889"              |
|     }                                     |
|   ],                                      |
|   "date": "2017-04-28",                   |
|   "dealership": "Valley View Auto Sales", |
|   "salesperson": {                        |
|     "id": "55",                           |
|     "name": "Frank Beasley"               |
|   },                                      |
|   "vehicle": [                            |
|     {                                     |
|       "extras": [                         |
|         "ext warranty",                   |
|         "paint protection"                |
|       ],                                  |
|       "make": "Honda",                    |
|       "model": "Civic",                   |
|       "price": "20275",                   |
|       "year": "2017"                      |
|     }                                     |
|   ]                                       |
| }                                         |
| {                                         |
|   "customer": [                           |
|     {                                     |
|       "address": "New York, NY",          |
|       "name": "Bradley Greenbloom",       |
|       "phone": "12127593751"              |
|     }                                     |
|   ],                                      |
|   "date": "2017-04-28",                   |
|   "dealership": "Tindel Toyota",          |
|   "salesperson": {                        |
|     "id": "274",                          |
|     "name": "Greg Northrup"               |
|   },                                      |
|   "vehicle": [                            |
|     {                                     |
|       "extras": [                         |
|         "ext warranty",                   |
|         "rust proofing",                  |
|         "fabric protection"               |
|       ],                                  |
|       "make": "Toyota",                   |
|       "model": "Camry",                   |
|       "price": "23500",                   |
|       "year": "2017"                      |
|     }                                     |
|   ]                                       |
| }                                         |
+-------------------------------------------+

Traversing Semi-structured Data

Insert a colon : between the VARIANT column name and any first-level element: <column>:<level1_element>.

Note

In the following examples, the query output is enclosed in double quotes because the query output is VARIANT, not VARCHAR. (The VARIANT values are not strings; the VARIANT values contain strings.) Operators : and subsequent . and [] always return VARIANT values containing strings.

For example, get a list of all dealership names:

SELECT src:dealership
    FROM car_sales
    ORDER BY 1;
+--------------------------+
| SRC:DEALERSHIP           |
|--------------------------|
| "Tindel Toyota"          |
| "Valley View Auto Sales" |
+--------------------------+

There are two ways to access elements in a JSON object:

Important

Regardless of which notation you use, the column name is case-insensitive but element names are case-sensitive. For example, in the following list, the first two paths are equivalent, but the third is not:

  • src:salesperson.name

  • SRC:salesperson.name

  • SRC:Salesperson.Name

Dot Notation

Use dot notation to traverse a path in a JSON object: <column>:<level1_element>.<level2_element>.<level3_element>. Optionally enclose element names in double quotes: <column>:"<level1_element>"."<level2_element>"."<level3_element>".

Note

The rules for JSON keys (element names) are different from the rules for Snowflake SQL identifiers.

For more information about the rules for Snowflake SQL identifiers, see: Identifier Requirements.

For more information about JSON keys, see http://json.org, in particular the description of a “string”.

If an element name does not conform to Snowflake SQL identifier rules, for example if it contains spaces, then you must enclose the name in double quotes. Below are some examples (not all of which are from the car_sales example above) of valid JSON element names that are not valid Snowflake identifier names unless they are surrounded by double quotes:

-- This contains a blank.
SELECT src:"company name" FROM partners;

-- This does not start with a letter or underscore.
SELECT zipcode_info:"94987" FROM addresses;

-- This contains characters that are not letters, digits, or underscores, and
-- it does not start with a letter or underscore.
SELECT measurements:"#sPerSquareInch" FROM english_metrics;

Get the names of all salespeople who sold cars:

SELECT src:salesperson.name
    FROM car_sales
    ORDER BY 1;
+----------------------+
| SRC:SALESPERSON.NAME |
|----------------------|
| "Frank Beasley"      |
| "Greg Northrup"      |
+----------------------+

Bracket Notation

Alternatively, use bracket notation to traverse the path in an object: <column>['<level1_element>']['<level2_element>']. Enclose element names in single quotes. Values are retrieved as strings.

Get the names of all salespeople who sold cars:

SELECT src['salesperson']['name']
    FROM car_sales
    ORDER BY 1;
+----------------------------+
| SRC['SALESPERSON']['NAME'] |
|----------------------------|
| "Frank Beasley"            |
| "Greg Northrup"            |
+----------------------------+

Retrieving a Single Instance of a Repeating Element

Retrieve a specific numbered instance of a child element in a repeating array by adding a numbered predicate (starting from 0) to the array reference.

Note that to retrieve all instances of a child element in a repeating array, it is necessary to flatten the array. See an example in Using the FLATTEN Function to Parse Arrays in this topic.

Get the vehicle details for each sale:

SELECT src:customer[0].name, src:vehicle[0]
    FROM car_sales
    ORDER BY 1;
+----------------------+-------------------------+
| SRC:CUSTOMER[0].NAME | SRC:VEHICLE[0]          |
|----------------------+-------------------------|
| "Bradley Greenbloom" | {                       |
|                      |   "extras": [           |
|                      |     "ext warranty",     |
|                      |     "rust proofing",    |
|                      |     "fabric protection" |
|                      |   ],                    |
|                      |   "make": "Toyota",     |
|                      |   "model": "Camry",     |
|                      |   "price": "23500",     |
|                      |   "year": "2017"        |
|                      | }                       |
| "Joyce Ridgely"      | {                       |
|                      |   "extras": [           |
|                      |     "ext warranty",     |
|                      |     "paint protection"  |
|                      |   ],                    |
|                      |   "make": "Honda",      |
|                      |   "model": "Civic",     |
|                      |   "price": "20275",     |
|                      |   "year": "2017"        |
|                      | }                       |
+----------------------+-------------------------+

Get the price of each car sold:

SELECT src:customer[0].name, src:vehicle[0].price
    FROM car_sales
    ORDER BY 1;
+----------------------+----------------------+
| SRC:CUSTOMER[0].NAME | SRC:VEHICLE[0].PRICE |
|----------------------+----------------------|
| "Bradley Greenbloom" | "23500"              |
| "Joyce Ridgely"      | "20275"              |
+----------------------+----------------------+

Explicitly Casting Values

When you extract values from a VARIANT, you can explicitly cast the values to the desired data type. For example, you can extract the prices as numeric values and perform calculations on them:

SELECT src:vehicle[0].price::NUMBER * 0.10 AS tax
    FROM car_sales
    ORDER BY tax;
+--------+
|    TAX |
|--------|
| 2027.5 |
| 2350.0 |
+--------+

By default, when VARCHARs, DATEs, TIMEs, and TIMESTAMPs are retrieved from a VARIANT column, the values are surrounded by double quotes. You can eliminate the double quotes by explicitly casting the values. For example:

SELECT src:dealership, src:dealership::VARCHAR
    FROM car_sales
    ORDER BY 2;
+--------------------------+-------------------------+
| SRC:DEALERSHIP           | SRC:DEALERSHIP::VARCHAR |
|--------------------------+-------------------------|
| "Tindel Toyota"          | Tindel Toyota           |
| "Valley View Auto Sales" | Valley View Auto Sales  |
+--------------------------+-------------------------+

For more information about casting VARIANT values, see Using Values in a VARIANT.

For more information about casting in general, see Data Type Conversion.

Using the FLATTEN Function to Parse Arrays

Parse an array using the FLATTEN function. FLATTEN is a table function that produces a lateral view of a VARIANT, OBJECT, or ARRAY column. The function returns a row for each object, and the LATERAL modifier joins the data with any information outside of the object.

Get the names and addresses of all customers. Cast the VARIANT output to string values:

SELECT
  value:name::string as "Customer Name",
  value:address::string as "Address"
  FROM
    car_sales
  , LATERAL FLATTEN(INPUT => SRC:customer);

+--------------------+-------------------+
| Customer Name      | Address           |
|--------------------+-------------------|
| Joyce Ridgely      | San Francisco, CA |
| Bradley Greenbloom | New York, NY      |
+--------------------+-------------------+

Using the FLATTEN Function to Parse Nested Arrays

The extras array is nested within the vehicle array in the sample data:

"vehicle" : [
     {"make": "Honda", "model": "Civic", "year": "2017", "price": "20275", "extras":["ext warranty", "paint protection"]}
   ]

Add a second FLATTEN clause to flatten the extras array within the flattened vehicle array and retrieve the “extras” purchased for each car sold:

SELECT
  vm.value:make::string as make,
  vm.value:model::string as model,
  ve.value::string as "Extras Purchased"
  FROM
    car_sales
    , LATERAL FLATTEN(INPUT => SRC:vehicle) vm
    , LATERAL FLATTEN(INPUT => vm.value:extras) ve
  ORDER BY make, model, "Extras Purchased";
+--------+-------+-------------------+
| MAKE   | MODEL | Extras Purchased  |
|--------+-------+-------------------|
| Honda  | Civic | ext warranty      |
| Honda  | Civic | paint protection  |
| Toyota | Camry | ext warranty      |
| Toyota | Camry | fabric protection |
| Toyota | Camry | rust proofing     |
+--------+-------+-------------------+

Parsing Text as VARIANT Values Using the PARSE_JSON Function

Parse text as a JSON document using the PARSE_JSON function.

If the input is NULL, the output will also be NULL. However, if the input string is null, it is interpreted as a VARIANT null value; that is, the result is not a SQL NULL but a real value used to represent a null value in semi-structured formats.

For an example, see Sample Data Used in Examples in this topic.

Extracting Values Using the GET Function

GET accepts a VARIANT, OBJECT, or ARRAY value as the first argument and extracts the VARIANT value of the element in the path provided as the second argument.

Compute and extract the last element of each array in a VARIANT column using the GET and ARRAY_SIZE functions. ARRAY_SIZE returns the size of the input array:

Note

This example departs from the car_sales table used elsewhere in this topic.

CREATE OR replace TABLE colors (v variant);

INSERT INTO
   colors
   SELECT
      parse_json(column1) AS v
   FROM
   VALUES
     ('[{r:255,g:12,b:0},{r:0,g:255,b:0},{r:0,g:0,b:255}]'),
     ('[{c:0,m:1,y:1,k:0},{c:1,m:0,y:1,k:0},{c:1,m:1,y:0,k:0}]')
    v;

SELECT *, GET(v, ARRAY_SIZE(v)-1) FROM colors;

+---------------+-------------------------+
| V             | GET(V, ARRAY_SIZE(V)-1) |
|---------------+-------------------------|
| [             | {                       |
|   {           |   "b": 255,             |
|     "b": 0,   |   "g": 0,               |
|     "g": 12,  |   "r": 0                |
|     "r": 255  | }                       |
|   },          |                         |
|   {           |                         |
|     "b": 0,   |                         |
|     "g": 255, |                         |
|     "r": 0    |                         |
|   },          |                         |
|   {           |                         |
|     "b": 255, |                         |
|     "g": 0,   |                         |
|     "r": 0    |                         |
|   }           |                         |
| ]             |                         |
| [             | {                       |
|   {           |   "c": 1,               |
|     "c": 0,   |   "k": 0,               |
|     "k": 0,   |   "m": 1,               |
|     "m": 1,   |   "y": 0                |
|     "y": 1    | }                       |
|   },          |                         |
|   {           |                         |
|     "c": 1,   |                         |
|     "k": 0,   |                         |
|     "m": 0,   |                         |
|     "y": 1    |                         |
|   },          |                         |
|   {           |                         |
|     "c": 1,   |                         |
|     "k": 0,   |                         |
|     "m": 1,   |                         |
|     "y": 0    |                         |
|   }           |                         |
| ]             |                         |
+---------------+-------------------------+

Extracting Values by Path Using the GET_PATH Function

Extract a value from a VARIANT column using the GET_PATH , : function. The function is a variation of GET, used to extract a value using a path name. GET_PATH is equivalent to a chain of GET functions.

Get the vehicle make for the car purchased by each customer:

SELECT GET_PATH(src, 'vehicle[0]:make') FROM car_sales;

+----------------------------------+
| GET_PATH(SRC, 'VEHICLE[0]:MAKE') |
|----------------------------------|
| "Honda"                          |
| "Toyota"                         |
+----------------------------------+

Traversing Semi-structured Data describes the path syntax used to retrieve elements in a VARIANT column. The syntax is shorthand for the GET or GET_PATH , : function. Unlike the path syntax, these functions can handle irregular paths or path elements.

The following queries produce the same results:

SELECT GET_PATH(src, 'vehicle[0].make') FROM car_sales;

SELECT src:vehicle[0].make FROM car_sales;

Parsing Arrays Directly from a Staged Data File

Assume a staged file named contacts.json.gz contains the following data:

{
    "root": [
        {
            "employees": [
                {
                    "firstName": "Anna",
                    "lastName": "Smith"
                },
                {
                    "firstName": "Peter",
                    "lastName": "Jones"
                }
            ]
        }
    ]
}

Also assume a file format named my_json_format includes TYPE=JSON in its definition.

Query the name of the first employee in the staged file. In this example, the file is located in the customers table stage, but it could be located in any internal (i.e. Snowflake) or external stage:

SELECT 'The First Employee Record is '||
    S.$1:root[0].employees[0].firstName||
    ' '||S.$1:root[0].employees[0].lastName
FROM @%customers/contacts.json.gz (file_format => 'my_json_format') as S;

+----------------------------------------------+
| 'THE FIRST EMPLOYEE RECORD IS '||            |
|      S.$1:ROOT[0].EMPLOYEES[0].FIRSTNAME||   |
|      ' '||S.$1:ROOT[0].EMPLOYEES[0].LASTNAME |
|----------------------------------------------|
| The First Employee Record is Anna Smith      |
+----------------------------------------------+
Back to top