In this tutorial, we will examine a regular expression for matching a URL. Using this example, we can begin to build a basic understanding of regular expressions, what each component represents, how they can be implemented in code to define search patterns, and how they can be used to validate that certain strings match specific criteria.
As previously stated, the following regular expression can be used to validate that a string is, in fact, a URL
/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/
To put it simply, this expression breaks down to the following:
- The string must begin with
https://
- The string can then include any numeric value
0-9
, any lowercase letter froma-z
,.
, or-
. The string may match this pattern one or more times .
must follow the previous pattern- The next part of the string must include any lowercase letters from
a-z
,.
, and be between2-6
characters in length - Next the string may include
/
followed byany alphanumeric character
from the basic latin alphabet (including_
),.
, or-
. This pattern can repeat 0 or more times - The previous pattern can be followed by
/
0 or one time
- Anchors
- Quantifiers
- Grouping Constructs
- Bracket Expressions
- Character Classes
- Character Escapes
- Author
There are many individual components that go into building regular expressions. Because regex is considered a literal, it is important to remember that these expressions must be wrapped in slash characters /
. When looking at our example expression for matching a URL, we can see /
at the beginning and end.
/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/
We will now take a look at each of the components that make up this expression.
^
and $
are the two anchor characters used in regex.
^
- This anchor is for the beginning of the expression, and identifies the beginning of the string that should follow it. this anchor can be used in two different ways:
- to signify an exact match, for example
^ABC
."ABC"
and"ABCDEFG"
would both be matches. - to signify a range of possible matches using bracket expressions. We will look into bracket expressions in a later section.
In regex, quantifiers define numbers of characters or expressions to match. First, we will look at all different types of quantifiers, then look at the specific quantifiers within our regex example for matching a URL.
-
*
- Matches the preceding pattern 0 or more times. Example:/fun*/
matches"fun"
in"funny"
and"un"
in"thunder"
, but there are no matches in"bird"
-
+
- Matches the preceding pattern 1 or more times. Example:/k+/
matches eachk
in"kick"
-
?
- Matches the preceding pattern 0 or 1 times. Example:/e?el?/
matches the"el"
in"gel"
and the"le"
in"cradle"
An important note about?
is that if it is used immediately following another quantifier (*
,+
,?
,{}
), it makes the quantifier non-greedy, meaning it will match the mininmum number of times, as opposed to the default, which matches the maximum number of times -
{ n }
- Matches the preceding pattern exactlyn
times -
{ n, }
- Matches the preceding pattern at leastn
times -
{ n, x }
- Matches the preceding pattern a minimum ofn
times and a maximum ofx
times
Looking back to our regex example for matching a URL, we can examine the different quantifiers present in the expression.
/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/
-
We see
?
used in a few different places. First, to signify that"https"
may match 0 or 1 times, then to signify that"https//"
may match 0 or 1 times, and finally at the end to signify that/
may match 0 or 1 times -
We see
+
used one time to signify that the pattern[\da-z\.-]
must match 1 or more times -
We see
{2, 6}
used to signify that the pattern[a-z\.]
must match a minimum of 2 and a maximum of 6 times. -
We see
*
used to signify that[\/\w \.-]
may match 0 or more times and that([a-z\.]{2,6})([\/\w \.-]*)
may match 0 or more times
Grouping Constructs are used in regex to check multiple parts or sections of a string for different requirements. By using parentheses (()
) around different sections of the regex, we can create subexpressions that have separate requirements from each other.
Unless instructed otherwise, Subexpressions look for exact matches. For example, given the subexpression (abc):(def)
, "abc:def"
would match, where "cba:fed"
would not.
In our regex example for matching a URL:
/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/
we see the following subexpressions:
-
(https?:\/\/)
-
([\da-z\.-]+)
-
([a-z\.]{2,6})
-
([\/\w \.-]*)
Bracket Expressions, also known as Positive Character Groups, represent a range of characters that we want to match in our string. When writing Bracket Expressions, we can include all characters that we want to match, but a more common practice is to use a hyphen to represent a range of these characters. For example [abcdef]
has the same meaning as [a-f]
and would indicate that we are looking for a string to include a
or b
or c
or d
or e
or f
. As long as the string includes one of the characters indicated, it will be considered a match.
In our regex example for matchin a URL:
/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/
we see the following Bracket Expressions:
-
[\da-z\.-]
- this bracket expression indicates that any numerical digit, any lowercase letter a-z, a slash, a dot, or a hyphen will produce a match -
[a-z\.]
- this bracket expression indicates that any lowercase letter a-z, a slash, or a dot will produce a match -
[\/\w \.-]
- this bracket expression indicates that any slash, any alphanumeric character, a dot, or a hyphen will produce a match
Character Classes are used in regex to define sets of characters of which one can occur in a string to produce a match. Bracket Expressions, discussed in the previous section, are one popular type of character class. Here are some other common character classes.
-
.
- Matches any character except for/n
(new line) -
/d
- Matches any numerical digit -
/w
- Matches any alphanumeric character from the latin alphabet, including an underscore (_
) -
/s
- Matches a single whitespace character, including tabs and line breaks
Note that for /d
, /w
, and /s
, can be changed to an inverse match by capitalizing the letter.
I our regex example for matching a URL:
/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/
we see character classes applied in the following ways:
-
[\da-z\.-]
- within this bracket expression (a character class unto itself), we see/d
being used to indicate that any numerical digit will produce a match -
[\/\w \.-]
- within this bracket expression, we see/w
being used to indicate that any alphanumeric character from the latin alphabet will produce a match.
Character Escapes are notated using a backslash (\
) in regex. Character escapes are used when a character is not intended to be interpreted literally. For example, {
usually indicates the beginning of a quantifier, but if we precede the curly brace with a backslash (\{
), regex will look for an opening curly brace rather than the beginning of a quantifier. This can be useful when looking for strings that include special characters. One caveat to character escapes (and all other special characters) is that they lose their functionality when included inside bracket expressions.
Looking at our regex example for matching a URL:
/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/
we can see where a character escape is being utilized.
\.
- here it is used to indicate that we are looking for the character.
explicitly, and not the character class
Andrew Mason studied web development at UC Berkeley in 2022. He enjoys creating new projects and being an active member of the developer community. To view more of his work, you can visit his GitHub Profile.