- Categories:
REGEXP_SUBSTR¶
Returns the substring that matches a regular expression within a string. If no match is found, returns NULL.
See also: String Functions (Regular Expressions)
Syntax¶
REGEXP_SUBSTR( <subject> , <pattern> [ , <position> [ , <occurrence> [ , <regex_parameters> [ , <group_num> ] ] ] ] )
Arguments¶
Required:
subject
The string to search for matches.
pattern
Pattern to match.
For guidelines on specifying patterns, see String Functions (Regular Expressions).
Optional:
position
Number of characters from the beginning of the string where the function starts searching for matches.
Default:
1
(the search for a match starts at the first character on the left)occurrence
Specifies which occurrence of the pattern to match. The function skips the first
occurrence - 1
matches.Default:
1
regex_parameters
String of one or more characters that specifies the regular expression parameters used to search for matches. The supported values are:
c
: case-sensitive.i
: case-insensitive.m
: multi-line mode.e
: extract sub-matches.s
: the ‘.’ wildcard also matches newline.
For more details, see the regular expression parameters documentation.
Default:
c
Note
By default, REGEXP_SUBSTR returns the entire matching part of the subject. However, if the
e
(for “extract”) parameter is specified, REGEXP_SUBSTR returns the part of the subject that matches the first group in the pattern. Ife
is specified but agroup_num
is not also specified, then thegroup_num
defaults to 1 (the first group). If there is no sub-expression in the pattern, REGEXP_SUBSTR behaves as ife
was not set. For examples that usee
, see Examples in this topic.group_num
The
group_num
parameter specifies which group to extract. Groups are specified by using parentheses in the regular expression.If a
group_num
is specified, Snowflake allows extraction even if the'e'
option was not also specified. The'e'
is implied.Snowflake supports up to 1024 groups.
For examples that use
group_num
, see the Examples in this topic.
Returns¶
The function returns a value of type VARCHAR that is the matching substring.
Usage Notes¶
For additional information on using regular expressions, see String Functions (Regular Expressions).
Collation Details¶
Arguments with collation specifications are currently not supported.
Examples¶
The documentation of the REGEXP_INSTR function contains many examples that use both REGEXP_SUBSTR and REGEXP_INSTR. You might want to look at those examples, too.
The next few examples use the table of strings created below:
CREATE TABLE demo2 (id INT, string1 VARCHAR);; INSERT INTO demo2 (id, string1) VALUES -- A string with multiple occurrences of the word "the". (2, 'It was the best of times, it was the worst of times.'), -- A string with multiple occurrences of the word "the" and with extra -- blanks between words. (3, 'In the string the extra spaces are redundant.'), -- A string with the character sequence "the" inside multiple words -- ("thespian" and "theater"), but without the word "the" by itself. (4, 'A thespian theater is nearby.') ;
The next example looks for:
the word “the”
followed by one or more non-word characters
followed by one or more word characters.
“Word characters” include not only the letters a-z and A-Z, but also the underscore (“_”) and the decimal digits 0-9, but not whitespace, punctuation, etc.
select id, regexp_substr(string1, 'the\\W+\\w+') as "RESULT" from demo2 order by id; +----+--------------+ | ID | RESULT | |----+--------------| | 2 | the best | | 3 | the string | | 4 | NULL | +----+--------------+
Starting from position 1 of the string, look for the 2nd occurrence of
the word “the”
followed by one or more non-word characters
followed by one or more word characters.
select id, regexp_substr(string1, 'the\\W+\\w+', 1, 2) as "RESULT" from demo2 order by id; +----+-------------+ | ID | RESULT | |----+-------------| | 2 | the worst | | 3 | the extra | | 4 | NULL | +----+-------------+
Starting from position 1 of the string, look for the 2nd occurrence of
the word “the”
followed by one or more non-word characters
followed by one or more word characters.
Rather than returning the entire match, return only the “group” (i.e. the portion of the substring that matches the part of the regular expression in parentheses). In this case, the returned value should be the word after “the”.
select id, regexp_substr(string1, 'the\\W+(\\w+)', 1, 2, 'e', 1) as "RESULT" from demo2 order by id; +----+--------+ | ID | RESULT | |----+--------| | 2 | worst | | 3 | extra | | 4 | NULL | +----+--------+
This example shows how to retrieve the second word from the first, second, and third matches of
a two-word pattern in which the first word is A
.
This also shows that trying to go beyond the last pattern causes Snowflake to return NULL.
CREATE TABLE demo3 (id INT, string1 VARCHAR);; INSERT INTO demo3 (id, string1) VALUES (5, 'A MAN A PLAN A CANAL') ;select id, regexp_substr(string1, 'A\\W+(\\w+)', 1, 1, 'e', 1) as "RESULT1", regexp_substr(string1, 'A\\W+(\\w+)', 1, 2, 'e', 1) as "RESULT2", regexp_substr(string1, 'A\\W+(\\w+)', 1, 3, 'e', 1) as "RESULT3", regexp_substr(string1, 'A\\W+(\\w+)', 1, 4, 'e', 1) as "RESULT4" from demo3; +----+---------+---------+---------+---------+ | ID | RESULT1 | RESULT2 | RESULT3 | RESULT4 | |----+---------+---------+---------+---------| | 5 | MAN | PLAN | CANAL | NULL | +----+---------+---------+---------+---------+
This example shows how to retrieve the first, second, and third groups within the first occurrence of the pattern.
In this case, the returned values are the individual letters of the word MAN
.
select id, regexp_substr(string1, 'A\\W+(\\w)(\\w)(\\w)', 1, 1, 'e', 1) as "RESULT1", regexp_substr(string1, 'A\\W+(\\w)(\\w)(\\w)', 1, 1, 'e', 2) as "RESULT2", regexp_substr(string1, 'A\\W+(\\w)(\\w)(\\w)', 1, 1, 'e', 3) as "RESULT3" from demo3; +----+---------+---------+---------+ | ID | RESULT1 | RESULT2 | RESULT3 | |----+---------+---------+---------| | 5 | M | A | N | +----+---------+---------+---------+
Here are some additional examples.
-- Prepare example
create or replace table message(body varchar(255));
insert into message values
('Hellooo World'),
('How are you doing today?'),
('the quick brown fox jumps over the lazy dog'),
('PACK MY BOX WITH FIVE DOZEN LIQUOR JUGS');
SELECT body, REGEXP_SUBSTR(body, '\\b\\S*o\\S*\\b') AS result FROM message;
+---------------------------------------------+------------------------------------------+
| BODY | result |
|---------------------------------------------+------------------------------------------|
| Hellooo World | Hellooo |
| How are you doing today? | How |
| the quick brown fox jumps over the lazy dog | brown |
| PACK MY BOX WITH FIVE DOZEN LIQUOR JUGS | NULL |
+---------------------------------------------+------------------------------------------+
SELECT body, REGEXP_SUBSTR(body, '\\b\\S*o\\S*\\b', 3) AS result FROM message;
+---------------------------------------------+-------------------------------------------+
| BODY | result |
|---------------------------------------------+-------------------------------------------|
| Hellooo World | llooo |
| How are you doing today? | you |
| the quick brown fox jumps over the lazy dog | brown |
| PACK MY BOX WITH FIVE DOZEN LIQUOR JUGS | NULL |
+---------------------------------------------+-------------------------------------------+
SELECT body, REGEXP_SUBSTR(body, '\\b\\S*o\\S*\\b', 3, 3) AS result FROM message;
+---------------------------------------------+----------------------------------------------+
| BODY | result |
|---------------------------------------------+----------------------------------------------|
| Hellooo World | NULL |
| How are you doing today? | today |
| the quick brown fox jumps over the lazy dog | over |
| PACK MY BOX WITH FIVE DOZEN LIQUOR JUGS | NULL |
+---------------------------------------------+----------------------------------------------+
SELECT body, REGEXP_SUBSTR(body, '\\b\\S*o\\S*\\b', 3, 3, 'i') AS result FROM message;
+---------------------------------------------+---------------------------------------------------+
| BODY | result |
|---------------------------------------------+---------------------------------------------------|
| Hellooo World | NULL |
| How are you doing today? | today |
| the quick brown fox jumps over the lazy dog | over |
| PACK MY BOX WITH FIVE DOZEN LIQUOR JUGS | LIQUOR |
+---------------------------------------------+---------------------------------------------------+
This example shows that you can explicitly omit any regular expression parameters by specifying empty string.
SELECT body, REGEXP_SUBSTR(body, '(H\\S*o\\S*\\b).*', 1, 1, '') AS result FROM message;
+---------------------------------------------+----------------------------------------------------+
| BODY | result |
|---------------------------------------------+----------------------------------------------------|
| Hellooo World | Hellooo World |
| How are you doing today? | How are you doing today? |
| the quick brown fox jumps over the lazy dog | NULL |
| PACK MY BOX WITH FIVE DOZEN LIQUOR JUGS | NULL |
+---------------------------------------------+----------------------------------------------------+
SELECT body, REGEXP_SUBSTR(body, '(H\\S*o\\S*\\b) .*', 1, 1, 'e') AS result FROM message;
+---------------------------------------------+------------------------------------------------------+
| BODY | result |
|---------------------------------------------+------------------------------------------------------|
| Hellooo World | Hellooo |
| How are you doing today? | How |
| the quick brown fox jumps over the lazy dog | NULL |
| PACK MY BOX WITH FIVE DOZEN LIQUOR JUGS | NULL |
+---------------------------------------------+------------------------------------------------------+
The following example illustrates overlapping occurrences:
-- Prepare example create or replace table overlap (id number, a string); insert into overlap values (1,',abc,def,ghi,jkl,'); insert into overlap values (2,',abc,,def,,ghi,,jkl,'); select * from overlap; select id, regexp_substr(a,'[[:punct:]][[:alnum:]]+[[:punct:]]', 1, 2) as result from overlap; +----+--------+ | ID | result | |----+--------| | 1 | ,ghi, | | 2 | ,def, | +----+--------+
The following example creates a JSON object from an Apache HTTP Server access log using pattern matching and concatenation:
-- Prepare example CREATE OR REPLACE TABLE log (logs varchar); INSERT INTO log (logs) VALUES ('127.0.0.1 - - [10/Jan/2018:16:55:36 -0800] "GET / HTTP/1.0" 200 2216'), ('192.168.2.20 - - [14/Feb/2018:10:27:10 -0800] "GET /cgi-bin/try/ HTTP/1.0" 200 3395');SELECT '{ "ip_addr":"' || REGEXP_SUBSTR (logs,'\\b\\d{1,3}\.\\d{1,3}\.\\d{1,3}\.\\d{1,3}\\b') || '", "date":"' || REGEXP_SUBSTR (logs,'([\\w:\/]+\\s[+\-]\\d{4})') || '", "request":"' || REGEXP_SUBSTR (logs,'\"((\\S+) (\\S+) (\\S+))\"', 1, 1, 'e') || '", "status":"' || REGEXP_SUBSTR (logs,'(\\d{3}) \\d+', 1, 1, 'e') || '", "size":"' || REGEXP_SUBSTR (logs,'\\d{3} (\\d+)', 1, 1, 'e') || '"}' as Apache_HTTP_Server_Access FROM log; +-----------------------------------------------------------------------------------------------------------------------------------------+ | APACHE_HTTP_SERVER_ACCESS | |-----------------------------------------------------------------------------------------------------------------------------------------| | { "ip_addr":"127.0.0.1", "date":"10/Jan/2018:16:55:36 -0800", "request":"GET / HTTP/1.0", "status":"200", "size":"2216"} | | { "ip_addr":"192.168.2.20", "date":"14/Feb/2018:10:27:10 -0800", "request":"GET /cgi-bin/try/ HTTP/1.0", "status":"200", "size":"3395"} | +-----------------------------------------------------------------------------------------------------------------------------------------+