- Regular expressions define a set of strings called the Language
- If a string matches the regular expression, it's in the language set
- There are only three operations (because math strives to be minimal)
- Concatenation - "a concatenated with b" is just
ab
- Or - written
a|b
. In most regex libraries - Klenee Star - zero or more of the expression it's applied to; written
a*
- Concatenation - "a concatenated with b" is just
- There's also parentheses, but they're not technically an operation. It's just how you denote what the arguments to your operations are.
- Languages (sets of strings) which can be recognized by regular expressions are called regular languages.
- This isn't every set of strings
- The next larger set of languages usually used is Context Free languages
- XML for example is not a regular language, it's context free
abc
⇒L = {"abc"}
a*
⇒L = {"", "a", "aa", "aaa", ...}
abc*
⇒L = {"ab", "abc", "abcc", ...}
a(bc)*
⇒L = {"a", "abc", "abcbc", "abcbcbc", ...}
a*b*
⇒L = {every string starting with any number of 'a's and ending with any number of 'b's}
a|b
⇒L = {"a", "b"}
a|b|c|d
⇒L = {"a", "b", "c", "d"}
one|two|(1|2|3)*|three
⇒L = {"one", "two", "three", "", "1", "11", ..., "134", "2334", ...}
(every string of 1,2, and 3 is in the language)
- Basically all libraries have nice shortcuts (for most of the shortcuts, they don't let you specify any extra languages on top of the set of regular languages)
- Expressions are made of three kinds of pieces
- Character classes
- A set of characters, any of which can go in the spot of the capture group.
- Single Characters -
a
- Any character / wildcard -
.
- Sets of characters
[abc]
which is the same as(a|b|c)
[^abc]
any char that's not a,b, or c. (In theory you have a constrained alphabet, so you just list the characters in the alphabet and not in the set in a giant or expression)[a-z]
,[A-Za-z]
ranges of characters. The second one is upper case or lower case letters
- Predefined character classes
\w
characters in an identifier (word); same as[A-Za-z0-9_]
\d
numerical digits only\s
whitespace characters- ... (check your library's documentation)
- Quantifiers
- Repeats the preceeding character class, or preceeding or expression in parentheses a specific amount of times
a*
zero or more repeatsa+
one or more repeats (ie, this regex does not match the empty string)a?
exactly zero or one copies.L = {"", "a"}
a{k}
,a{n,}
,a{n,m}
. Exactly k copies, n or more copies, between n and m copies respectively
- Or expressions
- No extras beyond what I showed before
- Character classes
- All of those extras above do not allow you to recognize more than regular languages, but those extra things do allow you to write shorter regular expressions that fit with real world problems
- Different libraries do different things
- Perl probably adds the most new features to regex
- There's a POSIX standard for regex
- Lua patterns don't even recognise all regular languags! They don't have unconstrained or, only character classes.
- In vim you have to escape a lot of the operators; the default is to do a literal match which means doing
/search term
works better
- Most libraries also have searching for matches instead of just saying true/false is this string in the language
- Most libraries also have replacements and capture groups.
- Syntax is like
s/regex/replacement/
(in vim to do the whole file it's:%s/regex/replacement/g
, if you put c on the end of that it will ask you for each match) - Capture groups let you paste the substring that an expression in parentheses matched wherever you want in the replacement
- ex:
s/.+\((.+)\) - .+)/\1/
would change"2017-04-13 module.cpp (INFO) - this is a message"
into"INFO"
. Here I escaped the parentheses to mean a literal paren character and the unescaped parens are the active capture group parens. I used the vim syntax for the first capture group\1
- Syntax is like
- For one-off replacements you usally don't have to be very specific with your character classes. Use a lot of
.
- Regex can also be used for input validation, you can specify a language (set of strings) that is acceptable input.
Now play around with some regular expressions in a nice online regex tester:
- regexr.com Has a nice documentation browser, and preloaded examples
- regex101.com Has a cool expression explanation generator, and community examples. Supports a few different regex implementations.