String Functions (Regular Expressions)

These string functions perform operations that match a regular expression (often referred to as a “regex”).

In this Topic:

List of Regex Functions

Function

Notes

REGEXP

Alias for RLIKE.

REGEXP_COUNT

REGEXP_EXTRACT_ALL

Alias for REGEXP_SUBSTR_ALL.

REGEXP_INSTR

REGEXP_LIKE

Alias for RLIKE.

REGEXP_REPLACE

REGEXP_SUBSTR

REGEXP_SUBSTR_ALL

RLIKE

General Usage Notes

In these notes, “subject” refers to the string to operate on and “pattern” refers to the regular expression:

  • The subject is typically a variable column, while the pattern is typically a constant, but this is not required; every argument to a regular expression function can be either a constant or variable.

  • Patterns support the full POSIX ERE (Extended Regular Expression) syntax. For details, see the POSIX basic and extended section (in Wikipedia).

  • Patterns also support the following Perl backslash-sequences:

    • \d: decimal digit (0-9).

    • \D: not a decimal digit.

    • \s: whitespace character.

    • \S: not a whitespace character.

    • \w: “word” character (a-z, A-Z, underscore (“_”), or decimal digit).

    • \W: not a word character.

    • \b: word boundary.

    • \B: not a word boundary.

    For details, see the Character classes section (in Wikipedia) or the Backslash sequences section (in the Perl documentation).

    Note

    In single-quoted string constants, you must escape the backslash character in the backslash-sequence. For example, to specify \d, use \\d. For details, see Specifying Regular Expressions in Single-Quoted String Constants (in this topic).

    You do not need to escape backslashes if you are delimiting the string with pairs of dollar signs ($$) (rather than single quotes).

  • By default, the POSIX wildcard character . (in the pattern) does not include newline characters \n (in the subject) as matches.

    To also match newline characters, either replace . with (.|\n) in the pattern argument, or use the s parameter in the parameters argument (described below).

  • All the regular expression functions support Unicode. A single Unicode character always counts as one character (i.e. the POSIX meta-character . matches exactly one Unicode character), regardless of the byte-length of the corresponding binary representation of that character. Also, for functions that take or return subject offsets, a single Unicode character counts as 1.

Specifying the Parameters for the Regular Expression

Most regular expression functions support an optional parameters argument as the very last input. The parameters argument is a VARCHAR string that specifies the matching behavior of the regular expression function. The following parameters are supported:

Parameter

Effect

c

Enables case-sensitive matching.

i

Enables case-insensitive matching.

m

Enables multi-line mode (i.e. meta-characters ^ and $ mark the beginning and end of any line of the subject). By default, multi-line mode is disabled (i.e. ^ and $ mark the beginning and end of the entire subject).

e

Extracts sub-matches; applies only to REGEXP_INSTR, REGEXP_SUBSTR, REGEXP_SUBSTR_ALL, and the aliases for these functions.

s

Enables the POSIX wildcard character . to match \n. By default, wildcard character matching is disabled.

The default string is simply c, which specifies:

  • Case-sensitive matching.

  • Single-line mode.

  • No sub-match extraction, except for REGEXP_REPLACE, which always uses sub-match extraction.

  • POSIX wildcard character . does not match \n newline characters.

When specifying multiple parameters, the string is entered with no spaces or delimiters. For example, ims specifies case-insensitive matching in multi-line mode with POSIX wildcard matching.

If both c and i are included in the parameters string, the one that occurs last in the string dictates whether the function performs case-sensitive or case-insensitive matching. For example, ci specifies case-insensitive matching because the “i” occurs last in the string.

Matching Characters That Are Metacharacters

In regular expressions, some characters are treated as metacharacters that have a specific meaning. For example:

To match the actual character (e.g. an actual period, asterisk, or question mark), you must escape the metacharacter with a backslash (e.g. \., \*, \?, etc.).

Note

If you are using the regular expression in a single-quoted string constant, you must escape the backslash with a second backslash (e.g. \\., \\*, \\?, etc.). For details, see Specifying Regular Expressions in Single-Quoted String Constants

For example, suppose that you need to find an open parenthesis (() in a string. One way to specify this is to use a backslash to escape the character in the pattern (e.g. \().

If you are specifying the pattern as a single-quoted string constant, you must also escape that backslash with a second backslash.

The following pattern matches a sequence of alphanumeric characters that appear inside parentheses (e.g. (NY)):

SELECT REGEXP_SUBSTR('Customers - (NY)','\\([[:alnum:]]+\\)') as customers;

  +-----------+
  | CUSTOMERS |
  |-----------|
  | (NY)      |
  +-----------+

For additional examples, see Example of Using Metacharacters in a Single-Quoted String Constant.

Note that you do not need to escape the backslash character if you are using a dollar-quoted string constant:

SELECT REGEXP_SUBSTR('Customers - (NY)',$$\([[:alnum:]]+\)$$) as customers;

  +-----------+
  | CUSTOMERS |
  |-----------|
  | (NY)      |
  +-----------+

Using Backreferences

Snowflake does not support backreferences in regular expression patterns (known as “squares” in formal language theory); however, backreferences are supported in the replacement string of the REGEXP_REPLACE function.

Specifying an Empty Pattern

In most regexp functions, an empty pattern (i.e. '') matches nothing, not even an empty subject.

The exceptions are REGEXP_LIKE and its aliases REGEXP and RLIKE, in which the empty pattern matches the empty subject because the pattern is implicitly anchored at both ends (i.e. '' automatically becomes '^$').

An empty group (i.e. sub-expression ()), matches the space in between characters, including the beginning and end of the subject.

Specifying Regular Expressions in Dollar-Quoted String Constants

If you are using a string constant to specify the regular expression for a function, you can use a dollar-quoted string constant to avoid having to escape the backslash characters in the regular expression. (If you are using single-quoted string constants, you need to escape the backslashes.)

The content of a dollar-quoted string constant is always interpreted literally.

For example, when escaping a metacharacter, you only need to use a single backslash:

SELECT w2
  FROM wildcards
  WHERE REGEXP_LIKE(w2, $$\?$$);

When using a backreference, you only need to use a single backslash:

SELECT w2, REGEXP_REPLACE(w2, '(.old)', $$very \1$$)
  FROM wildcards
  ORDER BY w2;

Specifying Regular Expressions in Single-Quoted String Constants

If you are using a regular expression in a single-quoted string constant, you must escape any backslashes in backslash-sequences with a second backslash.

Note

To avoid escaping backslashes in a regular expression, you can use a dollar-quoted string constant, rather than a single-quoted string constant.

For example:

Example of Using Metacharacters in a Single-Quoted String Constant

This example uses the backslash as part of an escape sequence in a regular expression that searches for a question mark (?).

Create a table and insert a row that contains a single backslash in one column and a question mark in another column:

create table wildcards (w varchar, w2 varchar);
insert into wildcards (w, w2) values ('\\', '?');

The following SELECT searches for the question mark literal. The search uses a regular expression, and the question mark is a meta-character in regular expressions, so the search must escape the question mark to treat it as a literal. Because the backslash appears in a string literal, the backslash itself must also be escaped, so the query looks like:

select w2
    from wildcards
    where regexp_like(w2, '\\?');
+----+
| W2 |
|----|
| ?  |
+----+

The following query makes it easier to see that the regular expression is composed of two characters (the backslash escape character and the question mark):

select w2
    from wildcards
    where regexp_like(w2, '\\' || '?');
+----+
| W2 |
|----|
| ?  |
+----+

In the previous example, the extra backslash was needed only because the escape character was part of a string literal, not for the regular expression itself. The following SELECT statement does not need to parse a string literal as part of the SQL command string, and therefore does not need the extra escape character that the string literal needed:

select w, w2, w || w2 as escape_sequence, w2
    from wildcards
    where regexp_like(w2, w || w2);
+---+----+-----------------+----+
| W | W2 | ESCAPE_SEQUENCE | W2 |
|---+----+-----------------+----|
| \ | ?  | \?              | ?  |
+---+----+-----------------+----+

Example of Using Backreferences in a Single-Quoted String Constant

If you use a backreference (e.g. \1) in a string literal, you must escape the backslash that is a part of that backreference. For example, to specify the backreference \1 in a replacement string literal of REGEXP_REPLACE, use \\1.

The following example uses the table created earlier. The SELECT uses a backreference to replace each occurrence of the regular expression .old with a copy of the matched string preceded by the word “very”:

insert into wildcards (w, w2) values (NULL, 'When I am cold, I am bold.');
select w2, regexp_replace(w2, '(.old)', 'very \\1')
    from wildcards
    order by w2;
+----------------------------+------------------------------------------+
| W2                         | REGEXP_REPLACE(W2, '(.OLD)', 'VERY \\1') |
|----------------------------+------------------------------------------|
| ?                          | ?                                        |
| When I am cold, I am bold. | When I am very cold, I am very bold.     |
+----------------------------+------------------------------------------+