Regular Expression Tutorial - Matching a URL

In this tutorial, we will examine a regular expression for matching a URL. Using this example, we can begin to build a basic understanding of regular expressions, what each component represents, how they can be implemented in code to define search patterns, and how they can be used to validate that certain strings match specific criteria.

Summary

As previously stated, the following regular expression can be used to validate that a string is, in fact, a URL

/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/

To put it simply, this expression breaks down to the following:

The string must begin with https://
The string can then include any numeric value 0-9, any lowercase letter from a-z, ., or -. The string may match this pattern one or more times
. must follow the previous pattern
The next part of the string must include any lowercase letters from a-z, ., and be between 2-6 characters in length
Next the string may include / followed by any alphanumeric character from the basic latin alphabet (including _), ., or -. This pattern can repeat 0 or more times
The previous pattern can be followed by / 0 or one time

Anchors
Quantifiers
Grouping Constructs
Bracket Expressions
Character Classes
Character Escapes
Author

Regex Components

There are many individual components that go into building regular expressions. Because regex is considered a literal, it is important to remember that these expressions must be wrapped in slash characters /. When looking at our example expression for matching a URL, we can see / at the beginning and end.

/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/

We will now take a look at each of the components that make up this expression.

Anchors

^ and $ are the two anchor characters used in regex.

^ - This anchor is for the beginning of the expression, and identifies the beginning of the string that should follow it. this anchor can be used in two different ways:

to signify an exact match, for example ^ABC. "ABC" and "ABCDEFG" would both be matches.
to signify a range of possible matches using bracket expressions. We will look into bracket expressions in a later section.

Quantifiers

In regex, quantifiers define numbers of characters or expressions to match. First, we will look at all different types of quantifiers, then look at the specific quantifiers within our regex example for matching a URL.

* - Matches the preceding pattern 0 or more times. Example: /fun*/ matches "fun" in "funny" and "un" in "thunder", but there are no matches in "bird"
+ - Matches the preceding pattern 1 or more times. Example: /k+/ matches each k in "kick"
? - Matches the preceding pattern 0 or 1 times. Example: /e?el?/ matches the "el" in "gel" and the "le" in "cradle" An important note about ? is that if it is used immediately following another quantifier (*, +, ?, {}), it makes the quantifier non-greedy, meaning it will match the mininmum number of times, as opposed to the default, which matches the maximum number of times
{ n } - Matches the preceding pattern exactly n times
{ n, } - Matches the preceding pattern at least n times
{ n, x } - Matches the preceding pattern a minimum of n times and a maximum of x times

Looking back to our regex example for matching a URL, we can examine the different quantifiers present in the expression.

/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/

We see ? used in a few different places. First, to signify that "https" may match 0 or 1 times, then to signify that "https//" may match 0 or 1 times, and finally at the end to signify that / may match 0 or 1 times
We see + used one time to signify that the pattern [\da-z\.-] must match 1 or more times
We see {2, 6} used to signify that the pattern [a-z\.] must match a minimum of 2 and a maximum of 6 times.
We see * used to signify that [\/\w \.-] may match 0 or more times and that ([a-z\.]{2,6})([\/\w \.-]*) may match 0 or more times

Grouping Constructs

Grouping Constructs are used in regex to check multiple parts or sections of a string for different requirements. By using parentheses (()) around different sections of the regex, we can create subexpressions that have separate requirements from each other.

Unless instructed otherwise, Subexpressions look for exact matches. For example, given the subexpression (abc):(def), "abc:def" would match, where "cba:fed" would not.

In our regex example for matching a URL: /^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/ we see the following subexpressions:

(https?:\/\/)
([\da-z\.-]+)
([a-z\.]{2,6})
([\/\w \.-]*)

Bracket Expressions

Bracket Expressions, also known as Positive Character Groups, represent a range of characters that we want to match in our string. When writing Bracket Expressions, we can include all characters that we want to match, but a more common practice is to use a hyphen to represent a range of these characters. For example [abcdef] has the same meaning as [a-f] and would indicate that we are looking for a string to include a or b or c or d or e or f. As long as the string includes one of the characters indicated, it will be considered a match.

In our regex example for matchin a URL: /^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/ we see the following Bracket Expressions:

[\da-z\.-] - this bracket expression indicates that any numerical digit, any lowercase letter a-z, a slash, a dot, or a hyphen will produce a match
[a-z\.] - this bracket expression indicates that any lowercase letter a-z, a slash, or a dot will produce a match
[\/\w \.-] - this bracket expression indicates that any slash, any alphanumeric character, a dot, or a hyphen will produce a match

Character Classes

Character Classes are used in regex to define sets of characters of which one can occur in a string to produce a match. Bracket Expressions, discussed in the previous section, are one popular type of character class. Here are some other common character classes.

. - Matches any character except for /n (new line)
/d - Matches any numerical digit
/w - Matches any alphanumeric character from the latin alphabet, including an underscore (_)
/s - Matches a single whitespace character, including tabs and line breaks

Note that for /d, /w, and /s, can be changed to an inverse match by capitalizing the letter.

I our regex example for matching a URL: /^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/ we see character classes applied in the following ways:

[\da-z\.-] - within this bracket expression (a character class unto itself), we see /d being used to indicate that any numerical digit will produce a match
[\/\w \.-] - within this bracket expression, we see /w being used to indicate that any alphanumeric character from the latin alphabet will produce a match.

Character Escapes

Character Escapes are notated using a backslash (\) in regex. Character escapes are used when a character is not intended to be interpreted literally. For example, { usually indicates the beginning of a quantifier, but if we precede the curly brace with a backslash (\{), regex will look for an opening curly brace rather than the beginning of a quantifier. This can be useful when looking for strings that include special characters. One caveat to character escapes (and all other special characters) is that they lose their functionality when included inside bracket expressions.

Looking at our regex example for matching a URL: /^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/ we can see where a character escape is being utilized.

\. - here it is used to indicate that we are looking for the character . explicitly, and not the character class

Author

Andrew Mason studied web development at UC Berkeley in 2022. He enjoys creating new projects and being an active member of the developer community. To view more of his work, you can visit his GitHub Profile.

atmason90/regex-tutorial.md

Regular Expression Tutorial - Matching a URL

Summary

Table of Contents

Regex Components

Anchors

Quantifiers

Grouping Constructs

Bracket Expressions

Character Classes

Character Escapes

Author