Semi-structured Data Types

The following Snowflake data types can contain other data types:

  • VARIANT (can contain any other data type).

  • ARRAY (can directly contain VARIANT, and thus indirectly contain any other data type, including itself).

  • OBJECT (can directly contain VARIANT, and thus indirectly contain any other data type, including itself).

We often refer to these data types as semi-structured data types. Strictly speaking, OBJECT is the only one of these data types that, by itself, has all of the characteristics of a true semi-structured data type. However, combining these data types allows you to explicitly represent arbitrary hierarchical data structures, which can be used to load and operate on data in semi-structured formats (e.g. JSON, Avro, ORC, Parquet, or XML).

Each of these data types is described below.

In this Topic:

VARIANT

Characteristics of a VARIANT

A VARIANT can store a value of any other type, including OBJECT and ARRAY.

The maximum length of a VARIANT is 16 MB.

Using Values in a VARIANT

To convert a value to or from the VARIANT data type, you can explicitly cast using the CAST function, the TO_VARIANT function or the :: operator (e.g. expression::variant).

In some situations, values can be implicitly cast to a VARIANT. For details, see Data Type Conversion.

The sample code below shows how to use a VARIANT, including how to convert from VARIANT and to VARIANT.

Create a table and insert a value:

CREATE TABLE varia (float1 FLOAT, v VARIANT, float2 FLOAT);
INSERT INTO varia (float1, v, float2) VALUES (1.23, NULL, NULL);

The first UPDATE converts a value from a FLOAT to a VARIANT. The second UPDATE converts a value from a VARIANT to a FLOAT.

UPDATE varia SET v = TO_VARIANT(float1);  -- converts FROM a float TO a variant.
UPDATE varia SET float2 = v::FLOAT;       -- converts FROM a variant TO a float.

SELECT all the values:

SELECT * FROM varia;
+--------+-----------------------+--------+
| FLOAT1 | V                     | FLOAT2 |
|--------+-----------------------+--------|
|   1.23 | 1.230000000000000e+00 |   1.23 |
+--------+-----------------------+--------+

To convert a value from the VARIANT data type, specify the data type that you want to convert to. For example, the following statement uses the :: operator to specify that the VARIANT should be converted to FLOAT:

SELECT my_variant_column::FLOAT * 3.14 FROM ...;

VARIANT stores both the value and the data type of the value. This allows you to also use VARIANT values in expressions where the value’s data type is valid without first casting the VARIANT. For example, if VARIANT column my_variant_column contains a numeric value, then you can directly multiply my_variant_column by another numeric value:

SELECT my_variant_column * 3.14 FROM ...;

You can retrieve the value’s native data type by using the TYPEOF function.

A VARIANT value can be missing (contain SQL NULL), which is different from a VARIANT null value, which is a real value used to represent a null value in semi-structured data. VARIANT null is a true value that compares as equal to itself. For more information, see VARIANT null.

If data was loaded from JSON format and stored in VARIANT, then the following guideline(s) apply:

  • For data that is mostly regular and uses only native JSON types (strings and numbers, not timestamps), both storage and query performance for operations on relational data and data in a VARIANT column is very similar. However, non-native values such as dates and timestamps are stored as strings when loaded into a VARIANT column, so operations on these values could be slower and also consume more space than when stored in a relational column with the corresponding data type.

For more information about using VARIANT, see: Considerations for Semi-structured Data Stored in VARIANT.

For more information about querying semi-structured data stored in a VARIANT, see: Querying Semi-structured Data.

Example of Inserting a VARIANT

To insert VARIANT data directly, use IIS (INSERT INTO ... SELECT). The example below shows how to insert JSON-formatted data into a VARIANT:

INSERT INTO varia (v) 
    SELECT TO_VARIANT(PARSE_JSON('{"key3": "value3", "key4": "value4"}'));

Common Uses for a VARIANT

A VARIANT is typically used when:

  • You want to create hierarchical data by explicitly defining a hierarchy that contains two or more ARRAYs or OBJECTs.

  • You want to load JSON, Avro, ORC, or Parquet data directly, without explicitly describing the hierarchical structure of the data.

    Snowflake can convert data from JSON, Avro, ORC, or Parquet format to an internal hierarchy of ARRAY, OBJECT, and VARIANT data and store that hierarchical data directly into a VARIANT. Although you can manually construct the data hierarchy yourself, it is usually easier to let Snowflake do it for you.

    For more information about loading and converting semi-structured data, see Loading Semi-structured Data.

OBJECT

A Snowflake OBJECT is analogous to a JSON “object”. In other programming languages, the corresponding data type is often called a “dictionary”, “hash”, or “map”.

An OBJECT contains key-value pairs.

Characteristics of an OBJECT

In a Snowflake OBJECT, each key is a VARCHAR, and each value is a VARIANT.

Because VARIANT can store any other data type, different values (in different key-value pairs) can have different underlying data types. For example, an OBJECT can hold a person’s name as a VARCHAR and a person’s age as an INTEGER. In the example below, both the name and the age are cast to VARIANT.

SELECT OBJECT_CONSTRUCT(
    'name', 'Jones'::VARIANT,
    'age',  42::VARIANT
    );

Snowflake does not currently support explicitly-typed objects.

In a key-value pair, the key should not be an empty string, and neither the key nor the value should be NULL.

The maximum length of an OBJECT is 16 MB.

An OBJECT can contain semi-structured data.

An OBJECT can be used to create hierarchical data structures.

Accessing Elements of an OBJECT by Key

To retrieve the value in an object, specify the key in square brackets, as shown below:

select my_variant_column['key1'] from my_table;

You can also use the colon operator. The following command shows that the results are the same whether you use the colon or the square brackets:

SELECT object_column['thirteen'],
       object_column:thirteen
    FROM object_example;
+---------------------------+------------------------+
| OBJECT_COLUMN['THIRTEEN'] | OBJECT_COLUMN:THIRTEEN |
|---------------------------+------------------------|
| 13                        | 13                     |
+---------------------------+------------------------+

For more information about the colon operator, see Dot Notation, which describes the use of the : and . operators to access nested data.

Example of Inserting an OBJECT

To insert OBJECT data directly, use IIS (INSERT INTO ... SELECT).

The following code uses the OBJECT_CONSTRUCT function to construct the OBJECT that it inserts.

INSERT INTO object_example (object_column)
    SELECT OBJECT_CONSTRUCT('thirteen', 13::VARIANT, 'zero', 0::VARIANT);

The following code uses an object literal to construct the OBJECT that it inserts.

INSERT INTO object_example (object_column)
    SELECT { 'thirteen': 13::VARIANT, 'zero': 0::VARIANT };

Note that in each key-value pair, the value was explicitly cast to VARIANT. Explicit casting was not required in these cases. Snowflake can implicitly cast to VARIANT. (For information about implicit casting, see Data Type Conversion.)

Common Uses for an OBJECT

An OBJECT is typically used when one or more of the following are true:

  • You have multiple pieces of data that are identified by strings. For example, if you want to look up information by province name, you might want to use an OBJECT.

  • You want to store information about the data with the data; the names (keys) are not merely distinct identifiers, but are meaningful.

  • The information has no natural order, or the order can be inferred solely from the keys.

  • The structure of the data varies, or the data can be incomplete. For example, if you want to create a catalog of books that usually contains the title, author name, and publication date, but in some cases the publication date is unknown, then you might want to use an OBJECT.

ARRAY

A Snowflake ARRAY is similar to an array in many other programming languages. An ARRAY contains 0 or more pieces of data. Each element is accessed by specifying its position in the array.

Characteristics of an ARRAY

Each value in an ARRAY is of type VARIANT. (A VARIANT can contain a value of any other data type.)

Values of other data types can be cast to VARIANT and then stored in an array. Some functions for ARRAYs, including ARRAY_CONSTRUCT, can implicitly cast values to VARIANT for you.

Because ARRAYs store VARIANT values, and because VARIANT values can store other data types within them, the underlying data types of the values in an ARRAY do not need to be identical. However, in most cases, the data elements are of the same or compatible types so that all can be processed the same way.

Snowflake does not support arrays of elements of a specific non-VARIANT type.

A Snowflake ARRAY is declared without specifying the number of elements. An ARRAY can grow dynamically based on operations such as ARRAY_APPEND. Snowflake does not currently support fixed-size arrays.

An ARRAY can contain NULL values.

The theoretical maximum combined size of all values in an ARRAY is 16 MB. However, ARRAYs have internal overhead. The practical maximum data size is usually smaller, depending upon the number and values of the elements.

Accessing Elements of an ARRAY by Index or by Slice

Array indexes are 0-based; the first element in an array is element 0.

Values in an array are accessed by specifying an array element’s index number in square brackets. For example, the following query reads the value at index position 2 in the array stored in my_array_column.

select my_array_column[2] from my_table;

Arrays can be nested. The following query reads the zeroeth element of the zeroeth element of a nested array:

select my_array_column[0][0] from my_table;

Attempting to access an element beyond the end of an array returns NULL.

A slice of an array is a sequence of adjacent elements (i.e. a contiguous subset of the array).

You can access a slice of an array by calling the ARRAY_SLICE function. For example:

select array_slice(my_array_column, 5, 10) from my_table;

The ARRAY_SLICE() function returns elements from the specified starting element (5 in the example above) up to but not including the specified ending element (10 in the example above).

An empty array or an empty slice is often denoted by a pair of square braces with nothing between them ([]).

Sparse vs Dense ARRAYs

An array can be dense or sparse.

In a dense array, the index values of the elements start at zero and are sequential (0, 1, 2, etc.). However, in a sparse array, the index values can be non-sequential (e.g. 0, 2, 5). The values do not need to start at 0.

If an index has no corresponding element, then the value corresponding to that index is said to be undefined. For example, if a sparse array has three elements, and those elements are at indexes 0, 2, and 5, then the elements at indexes 1, 3, and 4 are undefined.

   0            2                  5
+-----+.....+-------+.....+.....+------+
| Ann |     | Carol |     |     | Fred |
+-----+.....+-------+.....+.....+------+

         ^             ^     ^
         |             |     |
        undefined--------------

An undefined element is treated as an element. For example, consider the earlier example of a sparse array that contains elements at indexes 0, 2, and 5 (and does not have any elements after index 5). If you read the slice containing elements at indexes 3 and 4, then the output is similar to the following:

[ undefined, undefined ]

Attempting to access a slice beyond the end of an array results in an empty array, not an array of undefined values. The following SELECT statement attempts to read beyond the last element in the sample sparse array:

select array_slice(array_column, 6, 8) from table_1;

The output is an empty array:

+---------------------------------+
| array_slice(array_column, 6, 8) |
+---------------------------------+
| [ ]                             |
+---------------------------------+

Note that undefined is different from NULL. A NULL value in an array is a defined element.

In a dense array, each element consumes storage space, even if the value of the element is NULL.

In a sparse array, undefined elements do not directly consume storage space.

In a dense array, the theoretical range of index values is from 0 to 16777215. (The maximum theoretical number of elements is 16777216 because the upper limit on size is 16 MB (16777216 bytes) and the smallest possible value is one byte.)

In a sparse array, the theoretical range of index values is from 0 to 231 - 1. However, due to the 16 MB limitation, a sparse array cannot hold 231 values. The maximum theoretical number of values is still limited to 16777216.

(Remember that due to internal overhead, the practical size limit in both dense and sparse arrays is at least slightly less than the theoretical maximum of 16 MB.)

You can create a sparse array by using the ARRAY_INSERT function to insert values at specific index points in an array (leaving other array elements undefined). Note that because ARRAY_INSERT() pushes elements rightward, which changes the index values required to access them, you almost always want to fill a sparse array from left to right (i.e. from 0 up, increasing the index value for each new value inserted).

Example of Inserting an ARRAY

To insert ARRAY data directly, use IIS (INSERT INTO ... SELECT).

The following code uses the ARRAY_CONSTRUCT function to construct the ARRAY that it inserts.

INSERT INTO array_example (array_column)
    SELECT ARRAY_CONSTRUCT(12, 'twelve', NULL);

The following code uses an array literal to construct the ARRAY that it inserts.

INSERT INTO array_example (array_column)
    SELECT [ 12, 'twelve', NULL ];

Common Uses for an ARRAY

An ARRAY is typically used when one or more of the following are true:

  • You have many pieces of data, each of which is structured the same or similarly.

  • Each piece of data should be processed similarly. For example, you might loop through the data, processing each piece the same way.

  • The data has a natural order, for example, chronological.

Examples

This first example shows the output of a DESC TABLE command on a table with VARIANT, ARRAY, and OBJECT data.

CREATE OR REPLACE TABLE test_semi_structured(var VARIANT,
                                    arr ARRAY,
                                    obj OBJECT
                                    );

DESC TABLE test_semi_structured;

+------+---------+--------+-------+---------+-------------+------------+-------+------------+---------+
| name | type    | kind   | null? | default | primary key | unique key | check | expression | comment |
|------+---------+--------+-------+---------+-------------+------------+-------+------------+---------|
| VAR  | VARIANT | COLUMN | Y     | NULL    | N           | N          | NULL  | NULL       | NULL    |
| ARR  | ARRAY   | COLUMN | Y     | NULL    | N           | N          | NULL  | NULL       | NULL    |
| OBJ  | OBJECT  | COLUMN | Y     | NULL    | N           | N          | NULL  | NULL       | NULL    |
+------+---------+--------+-------+---------+-------------+------------+-------+------------+---------+

This example shows how to load simple values in the table, and what those values look like when you query the table.

Create a table and load the data:

CREATE TABLE demonstration1 (
    ID INTEGER,
    array1 ARRAY,
    variant1 VARIANT,
    object1 OBJECT
    );

INSERT INTO demonstration1 (id, array1, variant1, object1) 
  SELECT 
    1, 
    ARRAY_CONSTRUCT(1, 2, 3), 
    PARSE_JSON(' { "key1": "value1", "key2": "value2" } '),
    PARSE_JSON(' { "outer_key1": { "inner_key1A": "1a", "inner_key1B": "1b" }, '
              ||
               '   "outer_key2": { "inner_key2": 2 } } ')
    ;

INSERT INTO demonstration1 (id, array1, variant1, object1) 
  SELECT 
    2,
    ARRAY_CONSTRUCT(1, 2, 3, NULL), 
    PARSE_JSON(' { "key1": "value1", "key2": NULL } '),
    PARSE_JSON(' { "outer_key1": { "inner_key1A": "1a", "inner_key1B": NULL }, '
              ||
               '   "outer_key2": { "inner_key2": 2 } '
              ||
               ' } ')
  ;

Now show the data in the table.

SELECT * 
    FROM demonstration1
    ORDER BY id;
+----+-------------+---------------------+--------------------------+
| ID | ARRAY1      | VARIANT1            | OBJECT1                  |
|----+-------------+---------------------+--------------------------|
|  1 | [           | {                   | {                        |
|    |   1,        |   "key1": "value1", |   "outer_key1": {        |
|    |   2,        |   "key2": "value2"  |     "inner_key1A": "1a", |
|    |   3         | }                   |     "inner_key1B": "1b"  |
|    | ]           |                     |   },                     |
|    |             |                     |   "outer_key2": {        |
|    |             |                     |     "inner_key2": 2      |
|    |             |                     |   }                      |
|    |             |                     | }                        |
|  2 | [           | {                   | {                        |
|    |   1,        |   "key1": "value1", |   "outer_key1": {        |
|    |   2,        |   "key2": null      |     "inner_key1A": "1a", |
|    |   3,        | }                   |     "inner_key1B": null  |
|    |   undefined |                     |   },                     |
|    | ]           |                     |   "outer_key2": {        |
|    |             |                     |     "inner_key2": 2      |
|    |             |                     |   }                      |
|    |             |                     | }                        |
+----+-------------+---------------------+--------------------------+

For additional semi-structured data usage examples, see Querying Semi-structured Data.