Regex: Matching an Email

Regular expressions are powerful search tools comprised of a sequence of defining characters. With regex, you can search, validate, replace, and extract strings of text based on these defining characters.

Summary

In this tutorial, we will break down a regex used to match an email using the following code snippet:
/^([a-zA-Z0-9._%+-]+)@([a-zA-Z0-9.-]+)\.([a-zA-Z]{2,})$/
Going character by character, we will work through this snippet of code to understand how it is defining search parameters. Regular expressions can feel overwhelming at first, but by breaking the code down into different defining groups, we can easily understand how this expression is working.

Anchors
Quantifiers
Grouping Constructs
Bracket Expressions
Character Classes
The OR Operator
Flags
Character Escapes

Regex Components

Anchors

Anchors are special metacharacters that define positions within a string. Anchors help when looking for patterns that occur at certain points in a string. Some common anchor characters include:

^: Requires the following pattern to be at the beginning of the string
$: Requires the preceding pattern to be at the end of the string

Let's look at our email matching regex, /^([a-zA-Z0-9._%+-]+)@([a-zA-Z0-9.-]+)\.([a-zA-Z]{2,})$/. Here, we can see that there are two anchor characters in this regex. The ^ in the beginning of the snippet is asserting the start of the string, and is searching for the following pattern (([a-zA-Z0-9._%+-]+)). The $ at the end of the snippet is asserting the end of the string, and is searching for the preceding pattern (([a-zA-Z]{2,})).

Quantifiers

Regular expression quanitfiers are characters that control the number of times an element can occur in the input text. Lets look at some examples.

*: matches zero or more occurrences of a particular element.
+: matches one or more occurences of a particular element.
?: matches zero or one occurence of a particular element.
{n}: matches n number of occurences of a particular element.
{n,}: matches n or more occurences of a particualr element.

Let's look at the quantifiers in our regex. The + at the end of each of the first two groups is ensuring that there is at least one character in each of these groups. One or more of the characters must be matched. The {2,} at the end of the last grouping ensures that there must be at least 2 consecutive alphabet characters to match.

Grouping Constructs

Grouping constructs are smaller subexpressions within the larger regex. These subexpressions help with keeping a regex pattern organized. In our regex example, we will look more in depth at the paranthesis grouping construct. However, there are many other commonly used grouping constructs including:

(?:...): Non-capturing group
(?<name>...): Named capturing group

Let's look at the use of paranthesis for grouping constructs in our regex, /^([a-zA-Z0-9._%+-]+)@([a-zA-Z0-9.-]+)\.([a-zA-Z]{2,})$/. The first set or paranthesis constains the following capture group, ([a-zA-Z0-9._%+-]+). This capturing group is being used to identiudy the user-specific part of the email address, which could include any upper or lowercase characters, any digits, periods, underscores, percentage symbols, plus signs, or hyphens. The search parameters inside these paranthesis are specific to the user-specific part of the email address. The next set of paranthesis holds the following, ([a-zA-Z0-9.-]+). This subexpression is being used to identify the domain of the email address (gmail, yahoo, hotmail, etc). The search parameters of the domain include any upper or lowercase letters, digits, periods, and dashes. The last subexpression, ([a-zA-Z]{2,}), is looking for the TLD (top level domain) of the email (.com, .org, etc), matching tow or more upper or lowercase letters.

Bracket Expressions

Bracket expressions are used to define the search parameters for matching a single character. In our example, [a-zA-Z0-9._%+-], [a-zA-Z0-9.-], and [a-zA-Z] are all bracket expressions outlining the search paramteres for a specific character.

Character Classes

Character classes are predefined characters that allow us to match specific categories of characters in a string. This shorthand notation provide a more concise way to match characters in a regular expression. Lets look at some examples.

\d matches all digits and is equivalent to [0-9].
\w matches any word character and is equivalent to [a-zA-Z0-9_].

The OR Operator

The OR Operator, defined by the | symbol, allows the matching of one defined pattern or another withing the same expression. In our specific example, we do not see the use of the OR Operator. Lets look at some different examples to see it in use.

green|blue: This will look to match green or blue.
(i|e)ngrain: This will look to match ingrain or engrain.

The OR Operator is helpful when creating alternative searches inside the regex. This allows for searching of multiple different patterns at specific points in the text.

Flags

In terms of regex, flags are characters that add additional search parameters to the string. These flags will modify the behavior of the search. Lets look at some examples.

i flag will make the match case insensitive.
g flag will look for all instances of the match in the text, as opposed to stopping after the first match is found.
m flag change the behavior of the ^ and $ anchor tags to match at the start and the end of each line of the text, as opposed to the start and end of the entire text.

Character Escapes

Character escapes are sequences of characters inside a regular expression used to represent characters that would be difficult to represent directly in a string. Let's look at some examples.

\n represents a newline character.
\t represents a tab character.
\d represents any digits from 0-9.

Author

Hi, my name is Peyton Engborg and I am currently enrolled as a full stack web development student through the University of Utah. To check out more of my work, you can visit my GitHub.

phechzzz/regexTutorial.md