Speeding up queries of semi-structured data with search optimization

The search optimization service can improve the performance of point lookup and substring queries on semi-structured data in Snowflake tables (data in VARIANT, OBJECT, and ARRAY columns).

When VARIANT support for the search optimization service is configured for columns in a table, the search optimization service automatically includes VARIANT, OBJECT, and ARRAY columns in a search access path. This even applies to columns where the structure is deeply nested and the structure changes frequently. You may also enable search optimization for specific fields within a semi-structured column.

Note

You must explicitly enable this feature for specific columns or fields in columns using the ON clause in the ALTER TABLE … ADD SEARCH OPTIMIZATION command. (Enabling search optimization at the table level does not enable it for VARIANT columns.) For example:

ALTER TABLE mytable ADD SEARCH OPTIMIZATION ON EQUALITY(myvariantcol);
ALTER TABLE t1 ADD SEARCH OPTIMIZATION ON EQUALITY(c4:user.uuid);

ALTER TABLE mytable ADD SEARCH OPTIMIZATION ON SUBSTRING(myvariantcol);
ALTER TABLE t1 ADD SEARCH OPTIMIZATION ON SUBSTRING(c4:user.uuid);
Copy

Queries against a VARIANT column are not optimized if you omit the ON clause.

The next sections provide more details about this support:

Supported data types for constants and casts in predicates for VARIANT types

The search optimization service can improve the performance of point lookups of semi-structured data where the following types are used for the constant and the implicit or explicit cast for the element:

  • FIXED (including casts that specify a valid precision and scale)

  • INTEGER

  • TEXT

  • DATE (including casts that specify a scale)

  • TIME (including casts that specify a scale)

  • TIMESTAMP, TIMESTAMP_LTZ, TIMESTAMP_NTZ, TIMESTAMP_TZ (including casts that specify a scale)

The search optimization service supports the casting of types using:

Support for VARIANT values cast as TEXT

The search optimization service can also improve the performance of point lookups in which VARIANT columns are cast to TEXT and are compared to constants that are cast to TEXT.

For example, suppose that src is a VARIANT column containing BOOLEAN, DATE, and TIMESTAMP values that have been converted to VARIANT:

CREATE OR REPLACE TABLE test_table
(
  id INTEGER,
  src VARIANT
);

INSERT INTO test_table SELECT 1, TO_VARIANT('true'::BOOLEAN);
INSERT INTO test_table SELECT 2, TO_VARIANT('2020-01-09'::DATE);
INSERT INTO test_table SELECT 3, TO_VARIANT('2020-01-09 01:02:03.899'::TIMESTAMP);
Copy

For this table, the search optimization service can improve the following queries, which cast the VARIANT column to TEXT and compare the column to string constants:

SELECT * FROM test_table WHERE src::TEXT = 'true';
SELECT * FROM test_table WHERE src::TEXT = '2020-01-09';
SELECT * FROM test_table WHERE src::TEXT = '2020-01-09 01:02:03.899';
Copy

Supported predicates for point lookups on VARIANT types

The search optimization service can improve point lookup queries with the types of predicates listed below. In the examples below, src is the VARIANT column, and path_to_variant_field is a path to a field in the VARIANT column.

  • Equality predicates of the following form:

    WHERE path_to_variant_field[::target_data_type] = constant

    In this syntax, target_data_type (if specified) and the data type of constant must be one of the supported types.

    For example, the search optimization service supports:

    • Matching an element against a NUMBER constant without explicitly casting the element.

      WHERE src:person.age = 42;
      
      Copy
    • Explicitly casting an element to NUMBER with a specified precision and scale.

      WHERE src:location.temperature::NUMBER(8, 6) = 23.456789;
      
      Copy
    • Matching an element against a TEXT constant without explicitly casting the element.

      WHERE src:sender_info.ip_address = '123.123.123.123';
      
      Copy
    • Explicitly casting an element to TEXT.

      WHERE src:salesperson.name::TEXT = 'John Appleseed';
      
      Copy
    • Explicitly casting an element to DATE.

      WHERE src:events.date::DATE = '2021-03-26';
      
      Copy
    • Explicitly casting an element to TIMESTAMP with a specified scale.

      WHERE src:event_logs.exceptions.timestamp_info(3) = '2021-03-26 15:00:00.123 -0800';
      
      Copy
  • Predicates that use the ARRAY functions, such as:

    • WHERE ARRAY_CONTAINS(constant::VARIANT, path_to_variant_field)

      In this syntax, constant must not be NULL, and the data type of constant must be one of the supported types.

      For example:

      WHERE ARRAY_CONTAINS('77.146.211.88'::VARIANT, src:logs.ip_addresses)
      
      Copy
    • WHERE ARRAYS_OVERLAP(ARRAY_CONSTRUCT(constant_1, constant_2, .., constant_N), path_to_variant_field)

      The data type of each constant (constant_1, constant_2, etc.) must be one of the supported types. The constructed ARRAY can include NULL constants.

      For example:

      WHERE ARRAYS_OVERLAP(
        ARRAY_CONSTRUCT('122.63.45.75', '89.206.83.107'), src:senders.ip_addresses)
      
      Copy
  • The following predicates that check for NULL values:

    • WHERE IS_NULL_VALUE(path_to_variant_field)

      Note that IS_NULL_VALUE applies to JSON null values and not to SQL NULL values.

    • WHERE path_to_variant_field IS NOT NULL

    • WHERE variant_column IS NULL

      where variant_column refers to the column and not a path to an element in the semi-structured data.

      For example, the search optimization service supports using the VARIANT column src but not the path to the field src:person.age in that VARIANT column.

Substring search in VARIANT types

The search optimization service can optimize wildcard or regular expression searches in semi-structured columns (including ARRAY, OBJECT, and VARIANT columns) or fields in such columns. This includes predicates that use:

You can enable substring search optimization for a column or for multiple individual fields within a column. For example, the following statement enables substring search optimization for a nested field in a column.

ALTER TABLE test_table ADD SEARCH OPTIMIZATION ON SUBSTRING(col2:data.search);
Copy

Once the search access path has been built, the following query can be optimized:

SELECT * FROM test_table WHERE col2:data.search LIKE '%optimization%';
Copy

However, the following queries are not optimized because the WHERE clause filters do not apply to the field that was specified when search optimization was enabled (col2:data.search).

SELECT * FROM test_table WHERE col2:name LIKE '%simon%parker%';
SELECT * FROM test_table WHERE col2 LIKE '%hello%world%';
Copy

You can specify multiple fields to be optimized. Here, search optimization is enabled for two specific fields in the column col2.

ALTER TABLE test_table ADD SEARCH OPTIMIZATION ON SUBSTRING(col2:name);
ALTER TABLE test_table ADD SEARCH OPTIMIZATION ON SUBSTRING(col2:data.search);
Copy

If you enable search optimization for a given field, it is enabled for any subfields. The second ALTER TABLE statement below is redundant because the first statement enables search optimization for the entire data field, including the nested search field.

ALTER TABLE test_table ADD SEARCH OPTIMIZATION ON SUBSTRING(col2:data);
ALTER TABLE test_table ADD SEARCH OPTIMIZATION ON SUBSTRING(col2:data.search);
Copy

Similarly, enabling search optimization for an entire column allows all substring searches on that column to be optimized, including fields nested to any depth within it.

How constants are evaluated for VARIANT substring searches

When evaluating the constant string in a query (e.g. LIKE 'constant_string'), the search optimization service splits the string into tokens using the following characters as delimiters:

  • Square brackets ([ and ]).

  • Curly braces ({ and }).

  • Colons (:).

  • Commas (,).

  • Double quotes (").

After splitting the string into tokens, the search optimization service considers only tokens that are at least 5 characters long.

Example of a Predicate

How the Search Optimization Service Handles the Query

LIKE '%TEST%'

The search optimization service does not use search access paths for the following predicate because the substring is shorter than 5 characters.

LIKE '%SEARCH%IS%OPTIMIZED%'

The search optimization service can optimize this query, using search access paths to search for SEARCH and OPTIMIZED but not IS. IS is shorter than 5 characters.

LIKE '%HELLO_WORLD%'

The search optimization service can optimize this query, using search access paths to search for HELLO_WORLD.

LIKE '%COL:ON:S:EVE:RYWH:ERE%'

The search optimization service splits this string into COL, ON, S, EVE, RYWH, ERE. Because all of these tokens are shorter than 5 characters, the search optimization service cannot optimize this query.

LIKE '%{\"KEY01\":{\"KEY02\":\"value\"}%'

The search optimization service splits this into the tokens KEY01, KEY02, VALUE and uses the tokens when optimizing the query.

LIKE '%quo\"tes_and_com,mas,\"are_n\"ot\"_all,owed%'

The search optimization service splits this string into the tokens quo, tes_and_com, mas, are_n, ot, _all, owed. The search optimization service can only use the tokens that are 5 characters or longer (tes_and_com, are_n) when optimizing the query.

Current limitations in support for VARIANT types

Currently, support for VARIANT types in the search optimization service has the following limitations:

  • Predicates that use XMLGET are not supported.

  • Predicates of the form variant_field IS NULL are not supported.

  • Predicates where the constants are results of scalar subqueries are not supported.

  • Predicates that specify paths to elements that contain sub-elements are not supported.

The current limitations of the search optimization service also apply to this feature.