Regex Tutorial: Matching an HTML Tag

A regex, which is short for regular expression, is a sequence of characters that defines a specific search pattern. When included in code or search algorithms, regular expressions can be used to find certain patterns of characters within a string, or to find and replace a character or sequence of characters within a string. They are also frequently used to validate input.

Summary

This Regex tutorial will breakdown the regular expression that can be used to search for tags, and potentially extract the data, in an HTML document, using the following string of code:

/^<([a-z]+)([^<]+)*(?:>(.*)<\/\1>|\s+\/>)$/

Anchors
Quantifiers
Grouping Constructs
Bracket Expressions
Character Classes
The OR Operator
Character Escapes

Regex Components

Anchors

Anchors match the position before, after, or between characters in strings.

^ - Matches the position before the first character in the string.

$ - Matches the postiion at after the last character at the end of the string.

Quantifiers

Quantifiers specify how many instances of a character, group, or character class must be present in the input for a match to be found.

+ - Must have a match of one or more of the preceding token [a-z]. This appears three times within our string.

* - Must have zero or more matches of the preceding token. This is sometimes considered a 'greedy' quantifier; Hence, the 'zero or more' search parameter. This appears twice within our code: 1) After the ([^<]+) group; 2) After the . character.

Grouping Constructs

Grouping constructs delineate the subexpressions of a regular expression and capture the substrings of an input string. Parenthesis group multiple tokens together and create a capture group for extracting a substring or using a backreference.

In our string, we have the following groups:

([a-z]+) - This group construct is matching a character in the range 'a-z'. As mentioned above, the quantifier indicates that this must match at least once for this portion of the string.

([^<]+) - This group construct is using a special character class which will be described later on. In this intance, we are searching for any character that is NOT '<'. Once again, we are using the + quantifier.

(?:>(.*)<\/\1>|\s+\/>) - This is a 'Non-capturing group', which means we do not need this group to capture its match. To be more clear, this is defined by the ?: syntax after the first parenthetical.

Withiin this group, we also have another group (.*). We've already defined the quantifier here, but we also have the . character. This will be further explained in Character Classes.

Another special case we have is a 'backreference'. Backreferences match the same text as previously matched by a capturing group. You'll notice the \1 in our string. The '1' indicates that we want to use the first capturing group within our expression. This is good shorthand to use so that you do not write repetitive code. Always remember to use the forward slash \ before using the numerical!

Bracket Expressions

Anything inside a set of square brackets ([]) represents a range of characters that we want to match. These patterns are known as bracket expressions, but they are also known as a positive character group, because they outline the characters we want to include.

In our regular expression, we have [a-z] and [^<].

[a-z] will match any lowercase characters that throughout the entire alphabetical range of 'a' through 'z'. Note that if we wanted to match uppercase characters, we would need to include 'A-Z'. For example: [a-zA-Z]

[^<] is more of a special case. While '<' is indeed only looking to match that specific character, ^ is actually telling us that we wanting any character BUT the '<' character. ^ falls into our special character classes, explained in the next section.

Character Classes

Character classes further expand on what we know about bracket expressions. Bracket expressions list character classes within themselves.

With a-z, we have already explained what it is doing, but its character class is 'alphabetic'.

In regards to ^ (the carat), this is what is used to identify a negated set, or a negative character group. Any time we would want to EXCLUDE the characters listed in our set, we would place the ^ before the expression we don't want to match.

The dot . indicates that we want to match any character except for line breaks. It is one of the most commonly misused metacharacters, so be careful using this one!

\s will match any whitespace character, relative to its position in the expression. Why would we want to use this? A common mistake with an html tag is the unneeded spacing at the end of the tag. This helps cut away that mistake.

The OR Operator

| Also known as an 'alternation', the OR operator acts like a boolean OR, and matches the expression before or after the |. It can operate within a group, or on a whole expression. The patterns will be tested in order.

Character Escapes

The backslash \ in this regular expression indicates a special character. If you want to use any characters as a literal in a regex, you need to use the backslash. In this instance, we want to ensure that our forward slashes are identified. Since forward slashes are commonly used to denote boundaries, we want to escape that logic by using the backslash.

Author

Name: James Antley

Github Account: Jimmant91

Github Gist: Jimmant91

Jimmant91/tutorial.md

Select an option

No results found