Speeding up substring and regular expression queries with search optimization¶
Search optimization can improve the performance of queries with predicates that search for substrings or use regular expressions in text or semi-structured data. For details on how substring searches work with semi-structured data, see Speeding up queries of semi-structured data with search optimization.
This capability includes predicates that use:
Note
You must enable this feature for specific columns using the ON clause in the ALTER TABLE … ADD SEARCH OPTIMIZATION command. (Enabling search optimization at the table level does not optimize substring search.) For example:
ALTER TABLE mytable ADD SEARCH OPTIMIZATION ON SUBSTRING(mycol);
Substring searches are not optimized if you omit the ON clause.
The search optimization service can improve performance when searching for substrings that are 5 or more characters long. (More selective substrings can result in better performance.) The search optimization service does not use search access paths for the following predicate because the substring is shorter than 5 characters:
LIKE '%TEST%'
For the following predicate, the search optimization service can optimize this query, using search access paths to search for the
substrings for SEARCH
and OPTIMIZED
. However, search access paths are not used for IS
because the substring is shorter
than 5 characters.
LIKE '%SEARCH%IS%OPTIMIZED%'
For queries that use RLIKE, REGEXP, and REGEXP_LIKE against text:
The
subject
argument must be a TEXT column in a table that has search optimization enabled.The
pattern
argument must be a string constant.
For regular expressions, the search optimization service works best when:
The pattern contains at least one substring literal that is 5 or more characters long.
The pattern specifies that the substring should appear at least once.
For example, the following pattern specifies that string
should appear one or more times in the subject:
RLIKE '(string)+'
The search optimization service can improve the performance of queries with the following patterns because each predicate specifies that a substring of 5 or more characters must appear at least once. (Note that the first example uses a dollar-quoted string constant to avoid escaping the backslash characters.)
RLIKE $$.*email=[\w\.]+@snowflake\.com.*$$RLIKE '.*country=(Germany|France|Spain).*'RLIKE '.*phone=[0-9]{3}-?[0-9]{3}-?[0-9]{4}.*'
In contrast, search optimization does not use search access paths for queries with the following patterns:
Patterns without any substrings:
RLIKE '.*[0-9]{3}-?[0-9]{3}-?[0-9]{4}.*'
Patterns that only contain substrings shorter than 5 characters:
RLIKE '.*tel=[0-9]{3}-?[0-9]{3}-?[0-9]{4}.*'
Patterns that use the alternation operator where one option is a substring shorter than 5 characters:
RLIKE '.*(option1|option2|opt3).*'
Patterns in which the substring is optional:
RLIKE '.*[a-zA-z]+(string)?[0-9]+.*'
Even when the substring literals are shorter than 5 characters, the search optimization service can still improve query performance if expanding the regular expression produces a substring literal that is 5 characters or longer.
For example, consider the pattern:
.*st=(CA|AZ|NV).*(-->){2,4}.*
In this example:
Although the substring literals (e.g.
st=
,CA
, etc) are shorter than 5 characters, the search optimization service recognizes that the substringst=CA
,st=AZ
, orst=NV
(each of which is 5 characters long) must appear in the text.Similarly, even though the substring literal
-->
is shorter than 5 characters, the search optimization service determines that the substring-->-->
(which is longer than 5 characters) must appear in the text.
The search optimization service can use search access paths to match these substrings, which can improve the performance of the query.