String Functions (Regular Expressions)¶
These string functions perform operations that match a regular expression (often referred to as a “regex”).
List of Regex Functions¶
Alias for RLIKE.
Alias for REGEXP_SUBSTR_ALL.
Alias for RLIKE.
General Usage Notes¶
In these notes, “subject” refers to the string to operate on and “pattern” refers to the regular expression:
The subject is typically a variable column, while the pattern is typically a constant, but this is not required; every argument to a regular expression function can be either a constant or variable.
Patterns support the full POSIX ERE (Extended Regular Expression) syntax. For details, see the POSIX basic and extended section (in Wikipedia).
Patterns also support the following Perl backslash-sequences:
\d: decimal digit (0-9).
\D: not a decimal digit.
\s: whitespace character.
\S: not a whitespace character.
\w: “word” character (a-z, A-Z, underscore (“_”), or decimal digit).
\W: not a word character.
\b: word boundary.
\B: not a word boundary.
For details, see the Character classes section (in Wikipedia) or the Backslash sequences section (in the Perl documentation).
In single-quoted string constants, you must escape the backslash character in the backslash-sequence. For example, to specify
\\d. For details, see Specifying Regular Expressions in Single-Quoted String Constants (in this topic).
You do not need to escape backslashes if you are delimiting the string with pairs of dollar signs ($$) (rather than single quotes).
By default, the POSIX wildcard character
.(in the pattern) does not include newline characters
\n(in the subject) as matches.
To also match newline characters, either replace
patternargument, or use the
sparameter in the
parametersargument (described below).
All the regular expression functions support Unicode. A single Unicode character always counts as one character (i.e. the POSIX meta-character
.matches exactly one Unicode character), regardless of the byte-length of the corresponding binary representation of that character. Also, for functions that take or return subject offsets, a single Unicode character counts as 1.
Specifying the Parameters for the Regular Expression¶
Most regular expression functions support an optional
parameters argument as the very last input. The
parameters argument is a VARCHAR string that specifies the matching
behavior of the regular expression function. The following parameters are supported:
Enables case-sensitive matching.
Enables case-insensitive matching.
Enables multi-line mode (i.e. meta-characters
Extracts sub-matches; applies only to REGEXP_INSTR, REGEXP_SUBSTR, REGEXP_SUBSTR_ALL, and the aliases for these functions.
Enables the POSIX wildcard character
The default string is simply
c, which specifies:
No sub-match extraction, except for REGEXP_REPLACE, which always uses sub-match extraction.
POSIX wildcard character
.does not match
When specifying multiple parameters, the string is entered with no spaces or delimiters.
ims specifies case-insensitive matching in multi-line mode with POSIX wildcard matching.
i are included in the
parameters string, the one that occurs last in the string dictates whether the function performs case-sensitive or case-insensitive
matching. For example,
ci specifies case-insensitive matching because the “i” occurs last in the string.
Matching Characters That Are Metacharacters¶
In regular expressions, some characters are treated as metacharacters that have a specific meaning. For example:
*is a quantifier that matches zero or more instances of the preceding element. For example,
?is a quantifier that matches zero or one instance of the preceding element.
To match the actual character (e.g. an actual period, asterisk, or question mark), you must escape the metacharacter with a
If you are using the regular expression in a single-quoted string constant,
you must escape the backslash with a second backslash (e.g.
\\?, etc.). For details, see
Specifying Regular Expressions in Single-Quoted String Constants
For example, suppose that you need to find an open parenthesis (
() in a string. One way to specify this is to use a backslash
to escape the character in the pattern (e.g.
If you are specifying the pattern as a single-quoted string constant, you must also escape that backslash with a second backslash.
The following pattern matches a sequence of alphanumeric characters that appear inside parentheses (e.g.
SELECT REGEXP_SUBSTR('Customers - (NY)','\\([[:alnum:]]+\\)') as customers; +-----------+ | CUSTOMERS | |-----------| | (NY) | +-----------+
For additional examples, see Example of Using Metacharacters in a Single-Quoted String Constant.
Note that you do not need to escape the backslash character if you are using a dollar-quoted string constant:
SELECT REGEXP_SUBSTR('Customers - (NY)',$$\([[:alnum:]]+\)$$) as customers; +-----------+ | CUSTOMERS | |-----------| | (NY) | +-----------+
Snowflake does not support backreferences in regular expression patterns (known as “squares” in formal language theory); however, backreferences are supported in the replacement string of the REGEXP_REPLACE function.
Specifying an Empty Pattern¶
In most regexp functions, an empty pattern (i.e.
'') matches nothing, not even an empty subject.
The exceptions are REGEXP_LIKE and its aliases REGEXP and RLIKE,
in which the empty pattern matches the empty subject because the pattern is implicitly anchored at both ends
'' automatically becomes
An empty group (i.e. sub-expression
()), matches the space in between characters, including the beginning and end of the subject.
Specifying Regular Expressions in Dollar-Quoted String Constants¶
If you are using a string constant to specify the regular expression for a function, you can use a dollar-quoted string constant to avoid having to escape the backslash characters in the regular expression. (If you are using single-quoted string constants, you need to escape the backslashes.)
The content of a dollar-quoted string constant is always interpreted literally.
For example, when escaping a metacharacter, you only need to use a single backslash:
SELECT w2 FROM wildcards WHERE REGEXP_LIKE(w2, $$\?$$);
When using a backreference, you only need to use a single backslash:
SELECT w2, REGEXP_REPLACE(w2, '(.old)', $$very \1$$) FROM wildcards ORDER BY w2;
Specifying Regular Expressions in Single-Quoted String Constants¶
If you are using a regular expression in a single-quoted string constant, you must escape any backslashes in backslash-sequences with a second backslash.
To avoid escaping backslashes in a regular expression, you can use a dollar-quoted string constant, rather than a single-quoted string constant.
If you are escaping a metacharacter with a backslash, you must escape the backslash with a second backslash. See Example of Using Metacharacters in a Single-Quoted String Constant.
If you are using a backslash-sequence, you must escape the backslash in the sequence.
If you are using a backreference, you must escape the backslash in the backeference. See Example of Using Backreferences in a Single-Quoted String Constant.
Example of Using Metacharacters in a Single-Quoted String Constant¶
This example uses the backslash as part of an escape sequence in a regular expression that searches for a question mark (
Create a table and insert a row that contains a single backslash in one column and a question mark in another column:create table wildcards (w varchar, w2 varchar); insert into wildcards (w, w2) values ('\\', '?');
The following SELECT searches for the question mark literal. The search uses a regular expression, and the question mark is a meta-character in regular expressions, so the search must escape the question mark to treat it as a literal. Because the backslash appears in a string literal, the backslash itself must also be escaped, so the query looks like:select w2 from wildcards where regexp_like(w2, '\\?'); +----+ | W2 | |----| | ? | +----+
The following query makes it easier to see that the regular expression is composed of two characters (the backslash escape character and the question mark):select w2 from wildcards where regexp_like(w2, '\\' || '?'); +----+ | W2 | |----| | ? | +----+
In the previous example, the extra backslash was needed only because the escape character was part of a string literal, not for the regular expression itself. The following SELECT statement does not need to parse a string literal as part of the SQL command string, and therefore does not need the extra escape character that the string literal needed:select w, w2, w || w2 as escape_sequence, w2 from wildcards where regexp_like(w2, w || w2); +---+----+-----------------+----+ | W | W2 | ESCAPE_SEQUENCE | W2 | |---+----+-----------------+----| | \ | ? | \? | ? | +---+----+-----------------+----+
Example of Using Backreferences in a Single-Quoted String Constant¶
If you use a backreference (e.g.
\1) in a string literal, you must escape the backslash
that is a part of that backreference. For example, to specify the backreference
\1 in a replacement string literal of
The following example uses the table created earlier. The SELECT uses a backreference to replace each occurrence of the regular
.old with a copy of the matched string preceded by the word “very”:
insert into wildcards (w, w2) values (NULL, 'When I am cold, I am bold.');select w2, regexp_replace(w2, '(.old)', 'very \\1') from wildcards order by w2; +----------------------------+------------------------------------------+ | W2 | REGEXP_REPLACE(W2, '(.OLD)', 'VERY \\1') | |----------------------------+------------------------------------------| | ? | ? | | When I am cold, I am bold. | When I am very cold, I am very bold. | +----------------------------+------------------------------------------+