String Functions (Regular Expressions)

These string functions perform operations that match a regular expression (often referred to as a “regex”).

In this Topic:

List of Regex Functions

Function

Notes

REGEXP

Alias for RLIKE.

REGEXP_COUNT

REGEXP_INSTR

REGEXP_LIKE

Alias for RLIKE.

REGEXP_REPLACE

REGEXP_SUBSTR

RLIKE

General Usage Notes

In these notes, “subject” refers to the string to operate on and “pattern” refers to the regular expression:

  • The subject is typically a variable column, while the pattern is typically a constant, but this is not required; every argument to a regular expression function can be either a constant or variable.

  • Patterns support the full POSIX ERE (Extended Regular Expression) syntax. For details, see the POSIX basic and extended section (in Wikipedia).

  • Patterns also support the following Perl backslash-sequences:

    • \d: decimal digit (0-9).

    • \D: not a decimal digit.

    • \s: whitespace character.

    • \S: not a whitespace character.

    • \w: “word” character (a-z, A-Z, underscore (“_”), or decimal digit).

    • \W: not a word character.

    • \b: word boundary.

    • \B: not a word boundary.

    For details, see the Character classes section (in Wikipedia) or the Backslash sequences section (in the Perl documentation).

    Note that the backslash character needs to be escaped (i.e. \\ becomes \). For more information, see Escape Characters and Caveats (in this topic).

  • By default, the POSIX wildcard character . (in the pattern) does not include newline characters \n (in the subject) as matches.

    To also match newline characters, either replace . with (.|\n) in the pattern argument, or use the s parameter in the parameters argument (described below).

  • All the regular expression functions support Unicode. A single Unicode character always counts as one character (i.e. the POSIX meta-character . matches exactly one Unicode character), regardless of the byte-length of the corresponding binary representation of that character. Also, for functions that take or return subject offsets, a single Unicode character counts as 1.

Parameters

Most regular expression functions support an optional parameters argument as the very last input. The parameters argument is a VARCHAR string that specifies the matching behavior of the regular expression function. The following parameters are supported:

Parameter

Effect

c

Enables case-sensitive matching.

i

Enables case-insensitive matching.

m

Enables multi-line mode (i.e. meta-characters ^ and $ mark the beginning and end of any line of the subject). By default, multi-line mode is disabled (i.e. ^ and $ mark the beginning and end of the entire subject).

e

Extracts sub-matches; applies only to REGEXP_INSTR and REGEXP_SUBSTR.

s

Enables the POSIX wildcard character . to match \n. By default, wildcard character matching is disabled.

The default string is simply c, which specifies:

  • Case-sensitive matching.

  • Single-line mode.

  • No sub-match extraction, except for REGEXP_REPLACE, which always uses sub-match extraction.

  • POSIX wildcard character . does not match \n newline characters.

When specifying multiple parameters, the string is entered with no spaces or delimiters. For example, is specifies case-sensitive matching in multi-line mode with POSIX wildcard matching.

If both c and i are included in the parameters string, the one that occurs last in the string dictates whether the function performs case-sensitive or case-insensitive matching. For example, ci specifies case-sensitive matching because the “i” occurs last in the string.

Backreferences

Snowflake does not support backreferences in regular expression patterns (known as “squares” in formal language theory); however, backreferences are supported in the replacement string of the REGEXP_REPLACE function.

Corner Cases

In most regexp functions, an empty pattern (i.e. '') matches nothing, not even an empty subject.

The exceptions are REGEXP_LIKE and its aliases REGEXP and RLIKE, in which the empty pattern matches the empty subject because the pattern is implicitly anchored at both ends, i.e. '' automatically becomes '^$'.

An empty group (i.e. sub-expression ()), matches the space in between characters, including the beginning and end of the subject.

Escape Characters and Caveats

As mandated by the POSIX standard, the single backslash character \ is used to escape meta-characters (e.g. \* or \?). The backslash character is also used for so-called backslash-sequences (e.g. \w).

Note that the backslash character is further used to insert control characters into SQL strings (e.g. \n to insert a newline). As a result, to insert a single backslash character into a SQL string literal, the backslash character needs to be escaped (i.e. \\ becomes \).

For example, to insert the backreference \1 into a replacement string literal of REGEXP_LIKE, you might need to use \\\\1.