This tutorial explains how to use a regex to match a url.
This breaks down the regular expression(regex) that matches a URL.
Refer to the reference below on what the various operators mean.
Here is a breakdown of the matching URL regex:
/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/
We look at the anchors to separate the beginning of the expression and the end of the expression:
Beginning -
(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?
Ending - No operations
We break down the expression into capturing groups, literals, and order of evaluation.
-
(https?:\/\/)?
-
([\da-z\.-]+)
-
\.
-
([a-z\.]{2,6})
-
([\/\w \.-]*)*
-
\/?
We can breakdown what each expression is telling us.
-
(https?:\/\/)?
https
Literal Characters 'https'?
Lazy Match Operator Only look for the first instance of 'https':\/\/
Literal Characters and Escape Operator for '://'- Find the first instance of https://
-
([\da-z\.-]+)
[]
Bracket Expressions Compare the character to whats within the bracket\d
Character Class Matches any numeric digita-z
Literal Character Range Match any lower case letter\.
Escape Operator for period-
Literal Character for dash+
Quantifier One or more characters match the query- Have atleast one digit, lowercase letter, period, or dash (This would be the domain name)
-
\.
\.
Escape Operator for period- Have a period. This would seperate the domain name from the domain type (domain.com, domain.net, ...)
-
([a-z\.]{2,6})
[]
Bracket Expressions Compare the character to whats within the bracketa-z
Literal Character Range Match any lower case letter\.
Escape Operator for period{2,6}
Quatifier Between two and six characters- Match between 2 and 6 lower case letters or periods (This would be the domain type)
-
([\/\w \.-]*)*
[]
Bracket Expressions Compare the character to whats within the bracket\/
Escape Operator for slash\w
Character Class matches any alphanumeric character including the underscore\.
Escape Operator for period-
Literal Character for dash*
Quantifier zero or more characters match the query- Match any number of slashes, alphanumeric characters, periods, or dashes any number of times
-
\/?
\/
Escape Operator for slash$
Quantifier Matches zero or one time- It can end with a slash
- Literal Characters
- Escape Operator
- Anchors
- Quantifiers
- OR Operator
- Character Classes
- Flags
- Grouping and Capturing
- Bracket Expressions
- Greedy and Lazy Match
- Boundaries
- Back-references
- Look-ahead and Look-behind
Literal characters are evaluated as is. For instance if the regex is " a
" it will search the text for the character 'a'.
They can be defined in a range using a " -
". For instance [a-z]
defines all lowercase letters.
The escape character " \
" is used to define a literal character that is used as a different operator. For instance " \.
" defines a period.
Anchors define a point to start looking from.
^
- anchors to the start of a string.
$
- anchors to the end of a string.
Quantifiers are a way to match a range of any characters.
*
- The query can be matched any number of times.
+
- The query must match atleast one.
?
- The query matches zero or one time.
{}
- The query matches a range of times.
{n}
- match n times
{n, }
- match atleast n times
{n, m}
- match n to m times
With the OR - " |
" operator, the query on either side of the OR can match.
Used as a shorthand for literal characters.
.
- Matches any character except a newline.
\d
- Matches any numeric digit.
\w
- matches any alphanumeric character including the underscore equivalent to[A-Za-z0-9_]
\s
- Matches a single whitespace including tabs and line breaks
Flags are optional and specify the scope of a search placed at the end of the expression
g
- Global search
i
- Case insensitive search
m
- Multi-line search
Groups of regex expressions are within parenthesis. Groups multiple search expressions as one.
Expressions are within brackets. This groups several search terms.
A greedy expression seeks the maximum possible match for a given expression.
A lazy expression uses the " ?
" to end the query at the first possible match.
\b
- Defines a word boundary. Where any word would end the search for instance dog\b
would work with Houndog, but\b
dog would not.
\B
- Defines a non-word boundary. Where any non-word would end the search for instance\B
dog would match Houndog, but not dog.
\#
- Repeats a match from a previous group '#'.
For instance\1
would reference the first group.
?=Something
- Look-ahead - Checks to see if what follows a match is equal to Something.
?<=Something
- Look-behind - Checks to see if what is before a match is equal to Something.