Recently, I have been experimenting with regular expression (fondly called as 'regex') with a goal of understanding it's practical application in day today work. Sharing my learning in this article to help anyone read, understand and write basic constructs of regular expressions and its applications. It can be as simple or complex as we would like! Happy Reading!!
The example I would like to share is validating names. You might think what big deal in validating names, it is just words with alphabets in it. But when I started implementing, I noticed different attritubes a name can possibly have... to name a few Title (Mr., Mrs., Ms., Dr.), First Name, Middle name, Last name, Special characters (like Hyphens, Apostrophes), Suffix (Jr, Sr, I, IV etc.,). So what we think is simple can get complex based on the perfection we might need.
I'm going to present 2 examples covering different basic patterns.
Example 1: regex to validate a name of format "Title. Lastname, Firstname Middlename":
(?:(Mr.|Mrs.|Dr.|Ms.)?)\s+([-,a-z.',]*),?(?:\s*([-,a-z.',]*)?)\s*(?:\s*([-,a-z.',]*)?)\gi
Example 2: regex to validate US/Canada phone numbers of formats like 425.222.3333, 425-222-3333, (425) 222-3333, 425 222 3333
/^\(?\d{3}\)?[-.\s]?\d{3}[-.\s]\d{4}$/gmi
- Anchors
- Quantifiers
- Grouping Constructs
- Bracket Expressions
- Character Classes
- Alternation
- Flags
- Character Escapes
Anchors have special meaning in Regex. It doesnt denote a character, it denotes the position that needs to be matched before or after characters. ^ (Carat) is to match the beginning of the text. $ (dollar) is to match the end of the text. If a pattern is enclosed within ^ and $, the text matching the entire patttern will be true. Default for anchors is single line match. If we need multi line match, flag m will have to be suffixed to regex pattern
In my example 2 above, phone numbers with all of these formats 425.222.3333, 425-222-3333, (425) 222-3333,
425 222 3333 will be matched to the entire pattern is enclosed betweem ^ and $. EX: /^\(?\d{3}\)?[-.\s]?\d{3}[-.\s]\d{4}$/gmi
Quantifiers | Description |
---|---|
* | Matches the previous token between zero and unlimited times, as many times as possible, giving back as needed (greedy). In the above example 1, the * in the following pattern validates last name ([-,a-z.']*) matches the character class that has dash (-), comma (,), period (.) and alphabets (a through z) |
+ | Matches the previous token between 1 and more times, as many times as possible, giving back as needed (greedy). In the above example 1, the + in the following pattern (?:(Mr. |
? | Matches the previous token between zero and one times, as many times as possible, giving back as needed (greedy). In the above example 1, the ? in the following pattern ([-,a-z.',])?) looks for the [-,a-z.',]) token for its existence zero and one times, as many times as possible |
{n} | "n" is a positive integer, matches exactly "n" occurrences of the preceding preceding token. In the above example 2, {3} in this token (?\d{3})? checks if there 3 digits matching the pattern. |
There are 2 types of grouping constructs.
Parentheses group the regex between them. It captures the text matched by the regex inside them into a numbered group that can be reused with a numbered backreference. Regex operators can be applied to the entire grouped regex. In the above example 1, the token ([-,a-z.',]*) that validates last name is in the capturing group.
- Non-Capturing group When a capturing group is optional, it is call non-capturing group. In the above example 1, (?:\s*([-,a-z.',]*)?) is a non-capturing group
Character class is a set of characters grouped in a square brackets that literally matches a single character present in the list. In our example 2, the following pattern ([-,a-z.',]*) in last name literally matches the characters to dash (-), comma (,), period (.) and alphabets (a through z).
Alternation is the term in regular expression that is actually a simple “OR”. In a regular expression it is denoted with a vertical line character | . It looks for the literal match to the one of the strings separated by pipe from left to right.
In our example 1, the title validation is example of alternation using (?:(Mr.|Mrs.|Dr.|Ms.)?). This is looking for either Mr., Mrs., Dr. or Ms.
A flag is an optional parameter to a regex that modifies its behavior of searching. It is usually placed at the end of the regular experssion. In our example 2, /^(?\d{3})?[-.\s]?\d{3}[-.\s]\d{4}$/gmi.
Flag | Description |
---|---|
g | Makes the expression search for all occurrences |
m | Makes the boundary characters ^ and $ match the beginning and ending of every single line instead of the beginning and ending of the whole string. |
i | Makes the expression search case insensitive. |
Character escapes are backslash before literal to apply meaning. In our example 2, \d is to escape character d to match any digits between 0-9. Few examples are
Character Escape | Description |
---|---|
\w | Matches any alphanumeric character from the basic Latin alphabet, including the underscore |
\W | Matches any character that is not a word character from the basic Latin alphabet |
\r | Matches a carriage return |
An aspiring entepreneur and a full stack engineer using this opportunity to learn and write. If you liked this article or you have feedback. Please drop it in comments, I would appreciate it! Curious about my other works, please follow me in github