This tutorial explains how to use a regex to match a url.
This breaks down the regular expression(regex) that matches a URL.
Refer to the reference below on what the various operators mean.
Here is a breakdown of the matching URL regex:
/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/
We look at the anchors to separate the beginning of the expression and the end of the expression:
Beginning -
(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?
Ending - No operations
We break down the expression into capturing groups, literals, and order of evaluation.
-
(https?:\/\/)? -
([\da-z\.-]+) -
\. -
([a-z\.]{2,6}) -
([\/\w \.-]*)* -
\/?
We can breakdown what each expression is telling us.
-
(https?:\/\/)?httpsLiteral Characters 'https'?Lazy Match Operator Only look for the first instance of 'https':\/\/Literal Characters and Escape Operator for '://'- Find the first instance of https://
-
([\da-z\.-]+)[]Bracket Expressions Compare the character to whats within the bracket\dCharacter Class Matches any numeric digita-zLiteral Character Range Match any lower case letter\.Escape Operator for period-Literal Character for dash+Quantifier One or more characters match the query- Have atleast one digit, lowercase letter, period, or dash (This would be the domain name)
-
\.\.Escape Operator for period- Have a period. This would seperate the domain name from the domain type (domain.com, domain.net, ...)
-
([a-z\.]{2,6})[]Bracket Expressions Compare the character to whats within the bracketa-zLiteral Character Range Match any lower case letter\.Escape Operator for period{2,6}Quatifier Between two and six characters- Match between 2 and 6 lower case letters or periods (This would be the domain type)
-
([\/\w \.-]*)*[]Bracket Expressions Compare the character to whats within the bracket\/Escape Operator for slash\wCharacter Class matches any alphanumeric character including the underscore\.Escape Operator for period-Literal Character for dash*Quantifier zero or more characters match the query- Match any number of slashes, alphanumeric characters, periods, or dashes any number of times
-
\/?\/Escape Operator for slash$Quantifier Matches zero or one time- It can end with a slash
- Literal Characters
- Escape Operator
- Anchors
- Quantifiers
- OR Operator
- Character Classes
- Flags
- Grouping and Capturing
- Bracket Expressions
- Greedy and Lazy Match
- Boundaries
- Back-references
- Look-ahead and Look-behind
Literal characters are evaluated as is. For instance if the regex is " a " it will search the text for the character 'a'.
They can be defined in a range using a " - ". For instance [a-z] defines all lowercase letters.
The escape character " \ " is used to define a literal character that is used as a different operator. For instance " \. " defines a period.
Anchors define a point to start looking from.
^- anchors to the start of a string.
$- anchors to the end of a string.
Quantifiers are a way to match a range of any characters.
*- The query can be matched any number of times.
+- The query must match atleast one.
?- The query matches zero or one time.
{}- The query matches a range of times.
{n}- match n times
{n, }- match atleast n times
{n, m}- match n to m times
With the OR - " | " operator, the query on either side of the OR can match.
Used as a shorthand for literal characters.
.- Matches any character except a newline.
\d- Matches any numeric digit.
\w- matches any alphanumeric character including the underscore equivalent to[A-Za-z0-9_]
\s- Matches a single whitespace including tabs and line breaks
Flags are optional and specify the scope of a search placed at the end of the expression
g- Global search
i- Case insensitive search
m- Multi-line search
Groups of regex expressions are within parenthesis. Groups multiple search expressions as one.
Expressions are within brackets. This groups several search terms.
A greedy expression seeks the maximum possible match for a given expression.
A lazy expression uses the " ? " to end the query at the first possible match.
\b- Defines a word boundary. Where any word would end the search for instance dog\bwould work with Houndog, but\bdog would not.
\B- Defines a non-word boundary. Where any non-word would end the search for instance\Bdog would match Houndog, but not dog.
\#- Repeats a match from a previous group '#'.
For instance\1would reference the first group.
?=Something- Look-ahead - Checks to see if what follows a match is equal to Something.
?<=Something- Look-behind - Checks to see if what is before a match is equal to Something.