Considerations for Semi-structured Data Stored in VARIANT¶
This topic provides best practices, general guidelines, and important considerations for loading and working with VARIANT values that contain semi-structured data. This can be explicitly-constructed hierarchical data or data that you have loaded from semi-structured data formats such as JSON, Avro, ORC, and Parquet. The information in this topic does not necessarily apply to XML data.
Data Size Limitations¶
The VARIANT data type imposes a 16 MB size limit on individual rows.
For some semi-structured data formats (e.g. JSON), data sets are frequently a simple concatenation of multiple documents. The JSON output from some software is composed of a single huge array containing multiple records. There is no need to separate the documents with line breaks or commas, though both are supported.
If the data exceeds 16 MB, enable the STRIP_OUTER_ARRAY file format option for the COPY INTO <table> command to remove the outer array structure and load the records into separate table rows:
COPY INTO <table>
FROM @~/<file>.json
FILE_FORMAT = (TYPE = 'JSON' STRIP_OUTER_ARRAY = true);
Storing Semi-structured Data in a VARIANT Column vs. Flattening the Nested Structure¶
If you are not sure yet what types of operations you want to perform on your semi-structured data, Snowflake recommends storing the data in a VARIANT column for now.
For data that is mostly regular and uses only data types that are native to the semi-structured format you are using (e.g. strings and integers for JSON format), the storage requirements and query performance for operations on relational data and data in a VARIANT column is very similar.
For better pruning and less storage consumption, we recommend flattening your OBJECT and key data into separate relational columns if your semi-structured data includes:
Dates and timestamps, especially non-ISO 8601 dates and timestamps, as string values
Numbers within strings
Arrays
Non-native values (such as dates and timestamps in JSON) are stored as strings when loaded into a VARIANT column, so operations on these values could be slower and also consume more space than when stored in a relational column with the corresponding data type.
If you know your use cases for the data, perform tests on a typical data set. Load the data set into a VARIANT column in a table. Use the FLATTEN function to extract the OBJECTs and keys you plan to query into a separate table. Run a typical set of queries against both tables to see which structure provides the best performance.
Casting Key-Values¶
When extracting key-values from a VARIANT column, cast the values to the desired data type (using the ::
notation) to avoid unexpected results. For example, when you extract a string key-value without casting, the results are double-quoted (to show that the VARIANT value contains a string and not a different type; i.e. "1"
is a string while 1
is a number):
SELECT col1:city;
+----------------------+
| CITY |
|----------------------|
| "Los Angeles" |
+----------------------+
SELECT col1:city::string;
+----------------------+
| CITY |
|----------------------|
| Los Angeles |
+----------------------+
NULL Values¶
Snowflake supports two types of NULL values in semi-structured data:
SQL NULL: SQL NULL means the same thing for semi-structured data types as it means for structured data types: the value is missing or unknown.
JSON null (sometimes called “VARIANT NULL”): In a VARIANT column, JSON null values are stored as a string containing the word “null” to distinguish them from SQL NULL values.
The following example contrasts SQL NULL and JSON null:
select parse_json(NULL) AS "SQL NULL", parse_json('null') AS "JSON NULL", parse_json('[ null ]') AS "JSON NULL", parse_json('{ "a": null }'):a AS "JSON NULL", parse_json('{ "a": null }'):b AS "ABSENT VALUE"; +----------+-----------+-----------+-----------+--------------+ | SQL NULL | JSON NULL | JSON NULL | JSON NULL | ABSENT VALUE | |----------+-----------+-----------+-----------+--------------| | NULL | null | [ | null | NULL | | | | null | | | | | | ] | | | +----------+-----------+-----------+-----------+--------------+
To convert a VARIANT "null"
value to SQL NULL, cast it as a string. For example:
select parse_json('{ "a": null }'):a, to_char(parse_json('{ "a": null }'):a); +-------------------------------+----------------------------------------+ | PARSE_JSON('{ "A": NULL }'):A | TO_CHAR(PARSE_JSON('{ "A": NULL }'):A) | |-------------------------------+----------------------------------------| | null | NULL | +-------------------------------+----------------------------------------+
Semi-structured Data Files and Columnarization¶
When semi-structured data is inserted into a VARIANT column, Snowflake extracts as much of the data as possible to a columnar form, based on certain rules. The rest is stored as a single column in a parsed semi-structured structure. Currently, elements that have the following characteristics are not extracted into a column:
Elements that contain even a single “null” value are not extracted into a column. Note that this applies to elements with “null” values and not to elements with missing values, which are represented in columnar form.
This rule ensures that information is not lost, i.e, the difference between VARIANT “null” values and SQL NULL values is not obfuscated.
Elements that contain multiple data types. For example:
The
foo
element in one row contains a number:{"foo":1}
The same element in another row contains a string:
{"foo":"1"}
When a semi-structured element is queried:
If the element was extracted into a column, Snowflake’s execution engine (which is columnar) scans only the extracted column.
If the element was not extracted into a column, the execution engine must scan the entire JSON structure, and then for each row traverse the structure to output values, impacting performance.
To avoid this performance impact:
Extract semi-structured data elements containing “null” values into relational columns before loading them.
Alternatively, if the “null” values in your files indicate missing values and have no other special meaning, we recommend setting the file format option STRIP_NULL_VALUES to TRUE when loading the semi-structured data files. This option removes OBJECT elements or ARRAY elements containing “null” values.
Ensure each unique element stores values of a single data type native to the format (e.g. string or number for JSON).
Parsing NULL Values¶
To output a SQL NULL value from a VARIANT "null"
key-value, use the TO_CHAR , TO_VARCHAR function to cast the value as a string, e.g.:
SELECT column1
, TO_VARCHAR(PARSE_JSON(column1):a)
FROM
VALUES('{"a" : null}')
, ('{"b" : "hello"}')
, ('{"a" : "world"}');
+-----------------+-----------------------------------+
| COLUMN1 | TO_VARCHAR(PARSE_JSON(COLUMN1):A) |
|-----------------+-----------------------------------|
| {"a" : null} | NULL |
| {"b" : "hello"} | NULL |
| {"a" : "world"} | world |
+-----------------+-----------------------------------+
Using FLATTEN to Filter the Results in a WHERE Clause¶
The FLATTEN function explodes nested values into separate columns. You can use the function to filter query results in a WHERE clause.
The following example returns key-value pairs that match a WHERE clause and displays them in separate columns:
CREATE TABLE pets (v variant);
INSERT INTO pets SELECT PARSE_JSON ('{"species":"dog", "name":"Fido", "is_dog":"true"} ');
INSERT INTO pets SELECT PARSE_JSON ('{"species":"cat", "name":"Bubby", "is_dog":"false"}');
INSERT INTO pets SELECT PARSE_JSON ('{"species":"cat", "name":"dog terror", "is_dog":"false"}');
SELECT a.v, b.key, b.value FROM pets a,LATERAL FLATTEN(input => a.v) b
WHERE b.value LIKE '%dog%';
+-------------------------+---------+--------------+
| V | KEY | VALUE |
|-------------------------+---------+--------------|
| { | species | "dog" |
| "is_dog": "true", | | |
| "name": "Fido", | | |
| "species": "dog" | | |
| } | | |
| { | name | "dog terror" |
| "is_dog": "false", | | |
| "name": "dog terror", | | |
| "species": "cat" | | |
| } | | |
+-------------------------+---------+--------------+
Using FLATTEN to List Distinct Key Names¶
When working with unfamiliar semi-structured data, you might not know the key names in an OBJECT. You can use the FLATTEN function with the RECURSIVE argument to return the list of distinct key names in all nested elements in an OBJECT:
SELECT REGEXP_REPLACE(f.path, '\\[[0-9]+\\]', '[]') AS "Path",
TYPEOF(f.value) AS "Type",
COUNT(*) AS "Count"
FROM <table>,
LATERAL FLATTEN(<variant_column>, RECURSIVE=>true) f
GROUP BY 1, 2 ORDER BY 1, 2;
The REGEXP_REPLACE function removes the array index values (e.g. [0]
) and replaces them with brackets ([]
) to group array elements.
For example:
{"a": 1, "b": 2, "special" : "data"} <--- row 1 of VARIANT column
{"c": 3, "d": 4, "normal" : "data"} <----row 2 of VARIANT column
Output from query:
+---------+---------+-------+
| Path | Type | Count |
|---------+---------+-------|
| a | INTEGER | 1 |
| b | INTEGER | 1 |
| c | INTEGER | 1 |
| d | INTEGER | 1 |
| normal | VARCHAR | 1 |
| special | VARCHAR | 1 |
+---------+---------+-------+
Using FLATTEN to List Paths in an OBJECT¶
Related to Using FLATTEN to List Distinct Key Names, you can use the FLATTEN function with the RECURSIVE argument to retrieve all keys and paths in an OBJECT.
The following query returns keys, paths, and values (including VARIANT “null” values) for all data types stored in a VARIANT column. The code assumes that the VARIANT column contains an OBJECT in each row.
SELECT
t.<variant_column>,
f.seq,
f.key,
f.path,
REGEXP_COUNT(f.path,'\\.|\\[') +1 AS Level,
TYPEOF(f.value) AS "Type",
f.index,
f.value AS "Current Level Value",
f.this AS "Above Level Value"
FROM <table> t,
LATERAL FLATTEN(t.<variant_column>, recursive=>true) f;
The following query is similar to the first query, but excludes nested OBJECTs and ARRAYs:
SELECT
t.<variant_column>,
f.seq,
f.key,
f.path,
REGEXP_COUNT(f.path,'\\.|\\[') +1 AS Level,
TYPEOF(f.value) AS "Type",
f.value AS "Current Level Value",
f.this AS "Above Level Value"
FROM <table> t,
LATERAL FLATTEN(t.<variant_column>, recursive=>true) f
WHERE "Type" NOT IN ('OBJECT','ARRAY');
The queries return the following values:
- <variant_column>
OBJECT stored as a row in the VARIANT column.
- Seq
Unique sequence number associated with the data in the row.
- Key
String associated with a value in the data structure.
- Path
Path to the element within the data structure.
- Level
Level of the key-value pair within the data structure.
- Type
Data type for the value.
- Index
Index of the element in the data structure. Applies to ARRAY values only; otherwise NULL.
- Current Level Value
Value at the current level in the data structure.
- Above Level Value
Value one level higher in the data structure.