Use http://rubular.com/ or https://jex.im/regulex/#!flags=&re=%5E(a%7Cb)*%3F%24 to test your regex.
#####Regex Basics
- [abc] A single character of: a, b, or c
- [^abc] Any single character except: a, b, or c
- [a-z] Any single character in the range a-z
- [a-zA-Z] Any single character in the range a-z or A-Z
- ^ Start of line
- $ End of line
- \A Start of string
- \z End of string
- . Any single character
- \s Any whitespace character
- \S Any non-whitespace character
- \d Any digit
- \D Any non-digit
- \w Any word character (letter, number, underscore)
- \W Any non-word character
- \b Any word boundary
- (...) Capture everything enclosed
- (a|b) a or b
- a? Zero or one of a
- a* Zero or more of a
- a+ One or more of a
- a{3} Exactly 3 of a
- a{3,} 3 or more of a
- a{3,6} Between 3 and 6 of a
- ? (question mark) denotes optionality. It allows you to match either zero or one of the preceding character or group. For example, the pattern ab?c will match either the strings "abc" or "ac" because the b is considered optional.
#####Useful Regex matches
- (.*) groups zero or more of any character
- /^fill_and_stroke_(.*)/ group the last characters of a string starting with "fill_and_stroke_" (the grouped match gets saved into $1)
#####Ruby's $1 and \1 variables (see: http://www.regular-expressions.info/ruby.html)
- $1 is a global variable that contains the value of the match for the first, second, etc. parenthesized groups in the last regular expression.
- To re-insert the regex match, use \0 in the replacement string. You can use the contents of capturing groups in the replacement string with backreferences \1, \2, \3, etc.
Example:
"fill_and_stroke_some_method" =~ /^fill_and_stroke_(.*)/
$1 => "some method"
\1 => [nothing]
'\1' => syntax error, unexpected $undefined
"foobar".sub(/foo(.*)/, '\1\1') => "barbar"
"foobar".sub(/foo(.*)/, $1 + $1) => "barbar"
"foobar".sub(/foo(.*)/, $1 + '\1') => "barbar"
In summary, \1 and $1 hold the same information but \1 can only be used in the current context.
$1-$9 represent the content of the previous successful pattern match.
>> "hello world".match(/(hello) (world)/)
=> #<MatchData:0x12b06f0>
>> $1
=> "hello"
>> $2
=> "world"
>> $3
=> nil
The following two lines of code are equivalent:
name = line[/\bN\s+(\.?\w+)\s*;/, 1]
name = line =~ /\bN\s+(\.?\w+)\s*;/ && $1
Strings #[] method: str[regexp, capture] → new_str or nil If a Regexp is supplied, the matching portion of the string is returned. If a capture follows the regular expression, which may be a capture group index or name, that component of the MatchData is returned instead.
>> "hello world"[/(hello) (world)/]
=> "hello world"
>> "hello world"[/(hello) (world)/, 0]
=> "hello world"
>> "hello world"[/(hello) (world)/, 1]
=> "hello"
>> "hello world"[/(hello) (world)/, 2]
=> "world"
>> "hello world" =~ /(hello) (world)/
=> 0
>> "hello world" =~ /(hello) (world)/ && $1
=> "hello"
>> "hello world" =~ /(hello) (world)/ && $2
=> "world"
#####From Chapter 4 of Ruby Best Practices Book: If you want to match the name “James Gray” but also match “James gray”, “james Gray”, and “james gray”, the following code will do the trick:
# In regex [abc] matches a single character of: a, b, or c
>> ["James Gray", "James gray", "james gray", "james Gray"].all? { |e| e.match(/[Jj]ames [Gg]ray/) }
=> true
To match a four-digit number, we can write:
/\d{4}/
Oftentimes we mean “match this phrase,” but we write “match this sequence of characters.” The solution is to make use of anchors to clarify what we mean. Sometimes we want to match only if a string starts with a phrase:
>> phrases = ["Mr. Gregory Browne", "Mr. Gregory Brown is cool", "Gregory Brown is cool", "Gregory Brown"]
>> phrases.grep /\AGregory Brown\b/
=> ["Gregory Brown is cool", "Gregory Brown"]
Other times we want to ensure that the string contains the phrase:
>> phrases.grep /\bGregory Brown\b/
=> ["Mr. Gregory Brown is cool", "Gregory Brown is cool", "Gregory Brown"]
sometimes we want to ensure that the string matches an exact phrase:
>> phrases.grep /\AGregory Brown\z/
=> ["Gregory Brown"]
The key thing to take away from this is that when you use anchors. Anchors don’t actually match characters. Instead, they match between characters to allow you to assert certain expectations about your strings.
The full list of available anchors in Ruby is \A , \Z , \z , ^ , $ , and \b .
Use Caution When Working with Quantifiers
Don't use .* everywhere. If you really mean “at least one,” use + instead
# this works
>> "1234Foo"[/(\d*)Foo/,1]
=> "1234"
# but his returns empty
>> "xFoo"[/(\d*)Foo/,1]
=> ""
# so this will fail
if num = string[/(\d*)Foo/,1]
Integer(num)
end
# instead use this
if num = string[/(\d+)Foo/,1]
Integer(num)
end
Unbounded zero-matching quantifiers are tautologies. They can never fail to match, so you need to be sure to account for that. For example if you want a pattern to match Greg or Gregory use word boundaries:
>> "Gregory"[/Greg(ory)?/]
=> "Gregory"
>> "Greg"[/Greg(ory)?/]
=> "Greg"
# this is wrong
>> "Gregor"[/Greg(ory)?/]
=> "Greg"
>> "Gregory"[/\bGreg(ory)?\b/]
=> "Gregory"
>> "Greg"[/\bGreg(ory)?\b/]
=> "Greg"
# this is correct
>> "Gregor"[/\bGreg(ory)?\b/]
=> nil
Quantifiers are greedy by default. This means they’ll try to consume as much of the string as possible before matching. The following is an example of a greedy match:
# this code matches everything between the first and last # character
>> "# x # y # z #"[/#(.*)#/,1]
=> " x # y # z "
# if you want processing to happen from the left end and stop as soon as we have a match append a ? to the repetition
>> "# x # y # z #"[/#(.*?)#/,1]
=> " x "
All quantifiers can be made nongreedy this way.
Key things to remember are:
- Regular expressions are nothing more than a special language for find-and-replace operations, built on simple logical constructs.
- There are lots of shortcuts built in for common regular expression operations, so be sure to make use of special character classes and other simplifications when you can.
- Anchors provide a way to set up some expectation about where in a string you want to look for a match. These help with both optimization and pattern correctness.
- Quantifiers such as * and ? will always match, so they should not be used without sufficient boundaries.
- Quantifiers are greedy by default, and can be made nongreedy via ?.