Skip to content

Instantly share code, notes, and snippets.

@glv
Last active November 1, 2024 18:55
Show Gist options
  • Save glv/24bedd7d39f16a762528d7b30e366aa7 to your computer and use it in GitHub Desktop.
Save glv/24bedd7d39f16a762528d7b30e366aa7 to your computer and use it in GitHub Desktop.
Ruby and PostgreSQL Regular Expressions

Ruby and Postgres Regular Expression Syntaxes

Ruby's regular expressions are unusually powerful. Postgres' regular expressions are not as powerful, but they come close; close enough that it's possible to do many pattern-based queries and string transformations entirely in a query.

And sometimes, it's very useful to have a single regular expression that works in both environments, so that one can reliably query for data in a particular format and then transform the results in Ruby using the same pattern.

In any case, if you're working in a Postgres-based Rails app, you occasionally need to work with both kinds of regular expression. Unfortunately, there are subtle differences between the syntaxes that are hard to remember. Furthermore, the two systems organize their regexp documentation is different ways, so it's hard to even map between them easily. This table is aimed at making it easier.

(The Regular-Expressions.info site does support both of these flavors of regular expressions, and lets you do side-by-side comparisons, although finding the Postgres flavor isn't obvious. But I find it helpful to have all of the information on one page; and the process of building this file taught me a lot about both flavors of regular expression.)

Operators and Functions

Ruby Postgres Explanation
=~ ~ matches
~* matches (case-independent)
!~ !~ does not match
!~* does not match (case-independent)
string.match(re) returns a MatchData instance or nil if re does not match string
re.match(string) returns a MatchData instance or nil if re does not match string
substring(string from pattern) returns NULL if no match, the match text for the first capture group (if any), or the entire matched substring
regexp_replace(source, pattern, replacement [, flags ]) returns source, transformed per pattern and replacement (and flags, which can include g)
regexp_matches(string, pattern [, flags ]) returns no rows (if no match); a row containing an array of capture matches (if the pattern uses capture groups), or a row containing the entire matched substring. (The g flag, if supplied, results in one returned row for each match in string.)
regexp_split_to_table(string, pattern [, flags ]) splits string using pattern as a delimiter, returning a row for each fragment (ignoring zero-length fragments)
regexp_split_to_array(string, pattern [, flags ]) same as regexp_split_to_table except it returns an array of strings rather than rows

Alternation

Ruby Postgres Explanation
| | combines two expressions into a single one that matches either of the expressions; each expression is an alternative

Atoms

Atoms match a sequence of one or more characters (or zero or more characters, in the case of subpatterns).

Ruby Postgres Explanation
(re) (re) a sub-pattern (capturing)
(?<name>re) a sub-pattern (named capture)
(?"name"re) a sub-pattern (named capture)
(?:re) (?:re) a sub-pattern (non-capturing)
(?>re) a sub-pattern (atomic, non-capturing)
. . any single character
[chars] [chars] a character class
[^*chars*] [^*chars*] a negated character class
\k \k where k is non-alphanumeric: matches k
\c \c where c is alphanumeric: an escape
{ { if followed by a digit, introduces a bound quantifier; otherwise matches {
x x other characters match themselves

(Additionally, backreferences and escapes function as atoms.)

Quantifiers

Quantifiers can follow atoms, and they change the number of occurrences of the atom that can be matched. (Without a subsequent quantifier, an atom will always match exactly one consecutive occurrence.)

Greedy quantifiers match as many occurrences as possible while still allowing the overall match to succeed. By contrast, lazy quantifiers match the fewest number of occurrences as possible for overall success. A possessive quantifier does not backtrack once it has matched. It behaves like a greedy quantifier, but having matched it refuses to “give up” its match even if this jeopardises the overall match. This is sometimes helpful for optimizing the performance of a regular expression.

Ruby Postgres Explanation
* * zero or more (greedy)
+ + one or more (greedy)
? ? zero or one (greedy)
{m} {m} exactly m
{m,} {m,} at least m (greedy)
{,m} at most m (greedy)
{m,n} {m,n} at least m but no more than n (greedy)
*? *? zero or more (lazy)
+? +? one or more (lazy)
?? ?? zero or one (lazy)
{m}? {m}? exactly m
{m,}? {m,}? at least m (lazy)
{,m}? at most m (lazy)
{m,n}? {m,n}? at least m but no more than n (lazy)
*+ zero or more (possessive)
++ one or more (possessive)
?+ zero or one (possessive)
{m}+ exactly m
{m,}+ at least m (possessive)
{,m}+ at most m (possessive)
{m,n}+ at least m but no more than n (possessive)

Constraints / Anchors

Ruby Postgres Explanation
^ ^ beginning of line
\A \A beginning of string
$ $ end of line
\Z \Z end of string (just before a terminating newline, if any)
\z end of string
(?=re) (?=re) empty string when following characters match re
(?!re) (?!re) empty string when following characters do not match re
(?<=re) empty string when preceding characters match re
(?<!re) empty string when preceding characters do not match re
\b \y word boundary
\B \Y non-word-boundary
\m beginning of word
\M end of word

Character Classes

The following syntaxes are available within character class (bracket) expressions to enumerate the characters included (or excluded, in the case of negated classes).

Ruby Postgres Explanation
c c a single character
a-z a-z (where a and z are any characters) the entire sequence of characters from a to z inclusive
[class] embedded class (OR'd with neighboring characters)
&& intersection operator on embedded classes : [a-w&&[^c-g]z] means "([a-w] AND ([^c-g] OR z))" (in other words, [abh-w])
[:alnum:] [:alnum:] all alphanumeric characters
[:alpha:] [:alpha:] all alphabetic characters
[:blank:] [:blank:] all space and tab characters
[:cntrl:] [:cntrl:] all control characters
[:digit:] [:digit:] all decimal digits
[:graph:] [:graph:] all non-blank characters
[:lower:] [:lower:] all lowercase characters
[:print:] [:print:] all printable characters (like [:graph:], but includes space)
[:punct:] [:punct:] all punctuation characters
[:space:] [:space:] all whitespace characters (except tab)
[:upper:] [:upper:] all uppercase characters
[:xdigit:] [:xdigit:] all hexadecimal digits (same as [:digit:]A-Fa-f)
[:word:] all characters in the following Unicode general categories: Letter, Mark, Number, Connector_Punctuation
[:ascii:] all ASCII characters

Backreferences

Ruby Postgres Explanation
\nnn \nnn (up to three digits, with no leading zeroes) a backreference if not greater than the number of capturing groups; octal escape otherwise
\k<name> reference to a named capture
\g<name> reference to a named subpattern (re-evaluates the subpattern, rather than matching the same text)

Substitution Backreferences

In Ruby, information from the match may be used in any of three ways: as global variables ($something), as substitution references in the second argument of sub or gsub (\\something) or by calling into the MatchData object returned from match (md.something or md[:something]).

In Postgres, these can be used in the second argument to regexp_replace.

Ruby Postgres Explanation
$-, Regexp.last_match the MatchData object from the most recent match
$&, \&, md[0] \& the complete matched text
$`, \`, md.pre_match the text of the string preceding the match
$', \', md.post_match the text of the string after the match
$1, \1, md[1] \1 the first capture group (and so on for other numbered capture groups)
$+, \+ the last capture group
\k<name>, md[:name] the named capture group with name name

Escapes

Pattern Escapes

Ruby Postgres Explanation
\d \d [[:digit:]]
\D \D [^[:digit:]]
\h [[:xdigit:]]
\H [^[:xdigit:]]
\s \s [[:space:]]
\S \S [^[:space:]]
\w \w [[:alnum:]_] (note underscore is included)
\W \W [^[:alnum:]_] (note underscore is included)
\p{Alnum} Alphabetic and numeric character
\p{Alpha} Alphabetic character
\p{Blank} Space or tab
\p{Cntrl} Control character
\p{Digit} Digit
\p{Graph} Non-blank character (excludes spaces, control characters, and similar)
\p{Lower} Lowercase alphabetical character
\p{Print} Like \p{Graph}, but includes the space character
\p{Punct} Punctuation character
\p{Space} Whitespace character ([:blank:], newline, carriage return, etc.)
\p{Upper} Uppercase alphabetical
\p{XDigit} Digit allowed in a hexadecimal number (i.e., 0-9a-fA-F)
\p{Word} A member of one of the following Unicode general category Letter, Mark, Number, Connector_Punctuation
\p{ASCII} A character in the ASCII character set
\p{Any} Any Unicode character (including unassigned characters)
\p{Assigned} An assigned character
\p{L} Any character with Unicode General Category 'Letter'
\p{Ll} Any character with Unicode General Category 'Letter: Lowercase'
\p{Lm} Any character with Unicode General Category 'Letter: Mark'
\p{Lo} Any character with Unicode General Category 'Letter: Other'
\p{Lt} Any character with Unicode General Category 'Letter: Titlecase'
\p{Lu} Any character with Unicode General Category 'Letter: Uppercase
\p{Lo} Any character with Unicode General Category 'Letter: Other'
\p{M} Any character with Unicode General Category 'Mark'
\p{Mn} Any character with Unicode General Category 'Mark: Nonspacing'
\p{Mc} Any character with Unicode General Category 'Mark: Spacing Combining'
\p{Me} Any character with Unicode General Category 'Mark: Enclosing'
\p{N} Any character with Unicode General Category 'Number'
\p{Nd} Any character with Unicode General Category 'Number: Decimal Digit'
\p{Nl} Any character with Unicode General Category 'Number: Letter'
\p{No} Any character with Unicode General Category 'Number: Other'
\p{P} Any character with Unicode General Category 'Punctuation'
\p{Pc} Any character with Unicode General Category 'Punctuation: Connector'
\p{Pd} Any character with Unicode General Category 'Punctuation: Dash'
\p{Ps} Any character with Unicode General Category 'Punctuation: Open'
\p{Pe} Any character with Unicode General Category 'Punctuation: Close'
\p{Pi} Any character with Unicode General Category 'Punctuation: Initial Quote'
\p{Pf} Any character with Unicode General Category 'Punctuation: Final Quote'
\p{Po} Any character with Unicode General Category 'Punctuation: Other'
\p{S} Any character with Unicode General Category 'Symbol'
\p{Sm} Any character with Unicode General Category 'Symbol: Math'
\p{Sc} Any character with Unicode General Category 'Symbol: Currency'
\p{Sc} Any character with Unicode General Category 'Symbol: Currency'
\p{Sk} Any character with Unicode General Category 'Symbol: Modifier'
\p{So} Any character with Unicode General Category 'Symbol: Other'
\p{Z} Any character with Unicode General Category 'Separator'
\p{Zs} Any character with Unicode General Category 'Separator: Space'
\p{Zl} Any character with Unicode General Category 'Separator: Line'
\p{Zp} Any character with Unicode General Category 'Separator: Paragraph'
\p{C} Any character with Unicode General Category 'Other'
\p{Cc} Any character with Unicode General Category 'Other: Control'
\p{Cf} Any character with Unicode General Category 'Other: Format'
\p{Cn} Any character with Unicode General Category 'Other: Not Assigned'
\p{Co} Any character with Unicode General Category 'Other: Private Use'
\p{Cs} Any character with Unicode General Category 'Other: Surrogate'
\p{script} Any character from the Unicode script, where script is one of Arabic, Armenian, Balinese, Bengali, Bopomofo, Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Cham, Cherokee, Common, Coptic, Cuneiform, Cypriot, Cyrillic, Deseret, Devanagari, Ethiopic, Georgian, Glagolitic, Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, Inherited, Kannada, Katakana, Kayah_Li, Kharoshthi, Khmer, Lao, Latin, Lepcha, Limbu, Linear_B, Lycian, Lydian, Malayalam, Mongolian, Myanmar, New_Tai_Lue, Nko, Ogham, Ol_Chiki, Old_Italic, Old_Persian, Oriya, Osmanya, Phags_Pa, Phoenician, Rejang, Runic, Saurashtra, Shavian, Sinhala, Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Vai, and Yi

(Any of the above escapes of the form \p{something} can be negated by using the ^ character, as \p{^something}.)

Literal Character Escapes

Ruby Postgres Explanation
\\ \\ backslash
\a \a alert (bell) character, as in C
\b backspace (only in character class)
\b backspace, as in C
\B synonym for backslash (\) to help reduce the need for backslash doubling
\cX (where X is any character) the character whose low-order 5 bits are the same as those of X, and whose other bits are all zero
\e the escape character
\e the character whose collating-sequence name is ESC, or failing that, the character with octal value 033
\f \f form feed, as in C
\n \n newline, as in C
\r \r carriage return, as in C
\t \t horizontal tab, as in C
\uhhhh (where hhhh is exactly four hexadecimal digits) the character whose hexadecimal value is 0xhhhh
\Uhhhhhhhh (where hhhhhhhh is exactly eight hexadecimal digits) the character whose hexadecimal value is 0xhhhhhhhh
\v \v vertical tab, as in C
\xhhh (where hhh is any sequence of hexadecimal digits) the character whose hexadecimal value is 0xhhh (a single character no matter how many hexadecimal digits are used)
\0 the character whose value is 0 (the null byte)
\oo (where oo is exactly two octal digits, and is not a back reference) the character whose octal value is 0oo
\ooo (where ooo is exactly three octal digits, and is not a back reference) the character whose octal value is 0ooo

Options

In Ruby, the single-letter version of an option can be specified after the closing delimiter of the regexp. Options i, m, and x can also be embedded within the expression using the (?on-off:re) syntax, which turns on options on and turns off options off while interpreting subpattern re. The Regexp::CONSTANT version can be passed as the second parameter to Regexp.new (optionally combined with other constants with |).

In Postgres, the options can be included at the very start of an expression (possibly after an initial ***=) using the syntax (?opts); the options included in the string opts are in effect for the entire regular expression. The options can also be included in a string passed as a parameter to various pattern-related functions.

Do not assume that an option means the same thing in Ruby and Postgres just because it has the same letter! Consult the details of Ruby and Postgres options.

Ruby Postgres Explanation
i or Regexp::IGNORECASE case-insensitive
m or Regexp::MULTILINE multiline (treat newline as a character matched by .)
x or Regexp::EXTENDED ignore whitespace and comments in pattern
o perform #{} interpolation only once
u regexp is encoded as UTF-8
e regexp is encoded as EUC-JP
s regexp is encoded as Windows-31J
n regexp is encoded as ASCII-8BIT
***: (at beginning of pattern) the rest of the pattern is an ARE (Postgres advanced regular expression)
***= (at beginning of pattern) the rest of the pattern is a literal string
b rest of RE is a BRE (POSIX basic regular expression)
c case-sensitive matching (overrides operator type)
e rest of RE is an ERE (POSIX extended regular experession)
i case-insensitive matching (overrides operator type)
m historical synonym for n
n newline-sensitive matching
p partial newline-sensitive matching
q rest of RE is a literal ("quoted") string, all ordinary characters
s non-newline-sensitive matching (default)
t tight syntax (default)
w inverse partial newline-sensitive ("weird") matching
x expanded syntax
g (with regexp_replace and regexp_matches only) operate on all matches, not just the first
@masasakano
Copy link

As for the newline-related Regexp options of PostgreSQL (snwp), this is my understanding:

  • PostgreSQL default ("s"): No Ruby counterpart. (I think: . matches a newline and ^ is equivalent to \A.)
  • PostgreSQL "n" (=non-newline-sensitive; alias is "m") option: Ruby Default – Regexp without Regexp::MULTILINE (Ruby "m")
  • PostgreSQL "w" (=weird!) option: Ruby with Regexp::MULTILINE ("m") option
  • PostgreSQL "p" (=partial): No Ruby counterpart. Reverse of Postgres-"w" (. does not match a newline whereas ^ is equivalent to \A).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment