Speeding up substring and regular expression queries with search optimization

Search optimization can improve the performance of queries with predicates that search for substrings or use regular expressions in text or semi-structured data. For details on how substring searches work with semi-structured data, see Speeding up queries of semi-structured data with search optimization.

This capability includes predicates that use:

Note

You must enable this feature for specific columns using the ON clause in the ALTER TABLE … ADD SEARCH OPTIMIZATION command. (Enabling search optimization at the table level does not optimize substring search.) For example:

ALTER TABLE mytable ADD SEARCH OPTIMIZATION ON SUBSTRING(mycol);
Copy

Substring searches are not optimized if you omit the ON clause.

The search optimization service can improve performance when searching for substrings that are 5 or more characters long. (More selective substrings can result in better performance.) The search optimization service does not use search access paths for the following predicate because the substring is shorter than 5 characters:

LIKE '%TEST%'
Copy

For the following predicate, the search optimization service can optimize this query, using search access paths to search for the substrings for SEARCH and OPTIMIZED. However, search access paths are not used for IS because the substring is shorter than 5 characters.

LIKE '%SEARCH%IS%OPTIMIZED%'
Copy

For queries that use RLIKE, REGEXP, and REGEXP_LIKE against text:

  • The subject argument must be a TEXT column in a table that has search optimization enabled.

  • The pattern argument must be a string constant.

For regular expressions, the search optimization service works best when:

  • The pattern contains at least one substring literal that is 5 or more characters long.

  • The pattern specifies that the substring should appear at least once.

For example, the following pattern specifies that string should appear one or more times in the subject:

RLIKE '(string)+'
Copy

The search optimization service can improve the performance of queries with the following patterns because each predicate specifies that a substring of 5 or more characters must appear at least once. (Note that the first example uses a dollar-quoted string constant to avoid escaping the backslash characters.)

RLIKE $$.*email=[\w\.]+@snowflake\.com.*$$
Copy
RLIKE '.*country=(Germany|France|Spain).*'
Copy
RLIKE '.*phone=[0-9]{3}-?[0-9]{3}-?[0-9]{4}.*'
Copy

In contrast, search optimization does not use search access paths for queries with the following patterns:

  • Patterns without any substrings:

    RLIKE '.*[0-9]{3}-?[0-9]{3}-?[0-9]{4}.*'
    
    Copy
  • Patterns that only contain substrings shorter than 5 characters:

    RLIKE '.*tel=[0-9]{3}-?[0-9]{3}-?[0-9]{4}.*'
    
    Copy
  • Patterns that use the alternation operator where one option is a substring shorter than 5 characters:

    RLIKE '.*(option1|option2|opt3).*'
    
    Copy
  • Patterns in which the substring is optional:

    RLIKE '.*[a-zA-z]+(string)?[0-9]+.*'
    
    Copy

Even when the substring literals are shorter than 5 characters, the search optimization service can still improve query performance if expanding the regular expression produces a substring literal that is 5 characters or longer.

For example, consider the pattern:

.*st=(CA|AZ|NV).*(-->){2,4}.*
Copy

In this example:

  • Although the substring literals (e.g. st=, CA, etc) are shorter than 5 characters, the search optimization service recognizes that the substring st=CA, st=AZ, or st=NV (each of which is 5 characters long) must appear in the text.

  • Similarly, even though the substring literal --> is shorter than 5 characters, the search optimization service determines that the substring -->--> (which is longer than 5 characters) must appear in the text.

The search optimization service can use search access paths to match these substrings, which can improve the performance of the query.