Skip to content

Instantly share code, notes, and snippets.

@hectoregm
Created November 28, 2011 20:30
Show Gist options
  • Save hectoregm/1401897 to your computer and use it in GitHub Desktop.
Save hectoregm/1401897 to your computer and use it in GitHub Desktop.
Regexp, Ruby and You

Regular Expressions: The Good Parts (Ruby Edition)

Introduction

So what are regular expressions?, Wikipedia has a nice definition:

Regular expressions provides a concise and flexible means for "matching" (specifying and recognizing) strings of text, such as particular characters, words, or patterns of characters. Abbreviations for "regular expression" include "regex" and "regexp".

Regexp are present in each facet of modern computing, from command-line utilities, to text editors up to programming languages.

History

Regular expressions originate from automata theory and formal language theory. These fields study models of computation (automata) and ways to describe and classify formal languages. Using this formal knowledge Ken Thompson implemented an adaptation of regular expressions for searching text in the QED editor and later in the ed editor. An explosion of variations of Thompson adaptations came to being around the UNIX system including grep, emacs, vi and lex.

Syntax

The Regexp class implements regular expressions in Ruby, being an integral part of the language it has a literal form:

/<regexp>/<modifier>     # standard literal form, modifiers are optional
%r|<regexp>|<modifier>   # alternate form with delimiters choosen, like %Q literals.

Where <regexp> is a regular expression enclosed in slashes after which there are optional modifiers that change the behavior of the match. In a regular expression most characters are treated as literals, i.e. they only match themselves.

/Ruby!/        # matches "Ruby!"
/Coffeescript/ # matches "Coffeescript"

The exceptions are called metacharacters, I will describe them next.

. (Dot)

The dot matches any single character, except newline. Using the multiline modifier ('m') the dot will match newline.

[ ] (Character Classes)

The character classes define a set of literals that will match a single character, you can define a range of literals so you don't have to type all numbers or characters. If the character class begins with ^ then any character NOT in the class will match.

/[0-9]/     # matches any digit
/[^0-9]/    # matches anything other than a digit
/[0-9a-z]/  # matches any digit or downcase ASCII character
/[Rr]uby!/  # matches "Ruby!" or "ruby!"

?, +, *, { } (Quantifiers)

Quantifiers attempt to match the preceding expression a set of number of times:

<regexp>?        # matches zero or one time the previous expression
<regexp>+        # matches one or more times the previous expression
<regexp>*        # matches zero or more times the previous expression
<rexgexp>{m[,n]} # matches at least m but at most n times the previous expression

/ruby!?/     # matches "ruby" or "ruby!" the ! is optional
/ruby!+/     # matches "ruby" plus one or more !s
/ruby!*/     # matches "ruby" plus zero or more !s
/[0-9]{3}/   # matches exactly 3 digits
/[0-9]{3,5}/ # matches 3, 4 or 5 digits

By default the repetition is greedy, i.e the regex engine tries to repeat the pattern as often as possible. You can define a lazy repetition in which the regex engine tries to repeat the pattern as few times as possible.

/<.*>/   # matches "<p><span></span></p>"
/<.*?>/  # matches "<p>" in "<p><span></span></p>"
         # also lazy ??, +?, {n,m}?

( ) (Capturing group)

Subexpressions enclosed in parenthesis define a single syntactic unit that can be used with quantifiers, and it captures the matched text to be used later.

/[0-9][0-9]+/     # matches two or more digits
/([0-9][0-9])+/   # matches one or more pairs of numbers
/([Rr]uby(, )?)+/ # matches "Ruby", "Ruby, ruby, ruby"

| (Set union)

The alternation or choice operator matches either the expression before or the expression after the operator

/ruby|python/    # matches "ruby" or "python"
/ho(me|ur)/      # matches "home" or "hour"

^, $ (Anchors)

Anchors do not match characters but instead match the zero-width positions between characters, "anchoring" the match to a position at which a specific condition holds.

\[0-9]* (Backreferences)

It matches what the nth capturing group matched. Backreferences allows you to reuse a subexpression match.

/(.+)\1/                              # matches repeated words like "WikiWiki", "foofoo", etc
result = /(.+)\1/.match "WikiWiki"    # #<MatchData "WikiWiki" 1:"Wiki">
result.captures[0]                    # "Wiki"
                                      # each element in the captures array contains the
                                      # matched text of a capturing group

Modifiers

A regular expression literal can be followed by one or more optional flag characters that specify additional information about how the pattern matching is to be done.

/<regexp>/i       # ignore case when matching
/<regexp>/m       # allow . to match newline
/<regexp>/o       # Perform interpolations only once
/<regexp>/x       # allow whitespace and comments in regexp
/<regexp>/[neus]  # encoding: ASCII, EUC, UTF-8, SJIS

Further Reading

This is an overview of the syntax that is supported for regular expressions in Ruby, several more advanced features like named captures, lookahead and lookbehind have been omitted.

This material can get you to the next level:

String and Regexp

Regexp are deeply integrated in almost all scripting programming languages, Ruby is not an exception having Regexp as a core part of the language. Regexp and String have a symbiotic relationship that is made present given the amount of String methods that take regexp as argument, e.g. slice, gsub, sub, index, scan and split.

Good Parts

Powerful And Fast

"With great power comes great responsibility." - Stan Lee

One of the main strengths of regular expressions is the expressive power they can pack:

/4[0-9]{12}(?:[0-9]{3})?/                       # matches all Visa card numbers
/[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[A-Z]{2,4}/  # matches 99.99% of all email addresses

With less than 40 characters one can define a regular expression to match almost all email addresses or all valid Visa card numbers. For most cases the matching is super fast, but as we will see in the Ugly Parts there are some bumps in the road.

Bad Parts

What the hell was I thinking ?

So at the start of a Rails project you felt cocky and made one hell of a regexp to validate the email of a user. Months later a bug report comes in that users from a new top level domain are getting validation errors for their email at registration so your start up your trusty editor to find a complex regexp starring right back at you, and in that moment you swear to never again use regular expressions.

Like in other similar situations the presence of good comments can save the day. While properly documenting a regexp is not easy feat it's very doable using the x modifier:

email_matcher = /[A-Z0-9._%+-]+         # local-part
                @                       # @
                [A-Z0-9.-]+\.[A-Z]{2,4} # domain
                /ix

While I will like to tell you that some whitespace and formatting using the 'x' flag is all you will ever need the reality is that given the high compactness of the syntax some expressions and still be a bit difficult to fully grok.

One regexp to bind them all

"Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems." - Jamie Zawinski

One of the most frequent ways to get in trouble using regular expression is try to define a big badass expression that will handle all our needs. In the Good Parts we showed a regexp to match up to 99.99% of all valid email addresses, we could define a perfect expression to be able to match any email conformant to RFC5322 but it will be massive and unreadable. By making trade-offs we can define a much readable expression and still be able to match up to 99.99% of the set.

So before rushing to make a big badass regexp:

  • Ponder if what you think is a pattern is really a pattern.
  • Know exactly what are you trying to match, and what not.
  • Accept possible trade-offs between what's exact and what is practical

Ugly Parts

Exponential time in worst case

While almost all regexp implementations follow the Thompson notation, a division in the internal implementation of regular expressions have taken place over the years. The addition of features like backreferences provides an expressive power that exceeds that of regular languages, so a backtracking algorithm was needed. While this new expressive power is very welcome it brings along performance issues that the original algorithm didn't had, the backtracking algorithm runs in exponential time in the worst case while the original runs in linear time for all inputs.

Ruby supports backrefences so it uses a backtracking implementation so a programmer should be alert that some expressions can trigger exponential times for certain inputs, e.g:

hector@courant ~/regexp $ time echo "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" | grep -E "(a?){1,40}a{40}"
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

real	0m0.012s
user	0m0.004s
sys 	0m0.005s

hector@courant ~/regexp $ time ruby -e 'p ("a"*40).match /(a?){1,40}a{40}/'
#<MatchData "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" 1:"">

real	1m44.143s
user	1m39.664s
sys 	0m0.504s

For a detail exposure of this problem I recommend the following essay: Regular Expression Matching Can Be Simple And Fast.

Conclusions

Regular expressions are a potent tool to search in a sea of strings for the information that you want. Its power comes with a price, the need to be proactive in the implementation of good practices to prevent huge unreadable expressions. As Ruby programmers we have to test the behavior of our expressions to detect exponential run times given our expected input.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment