Skip to content

Instantly share code, notes, and snippets.

@Cleffy
Last active October 18, 2023 01:03
Show Gist options
  • Save Cleffy/a2a85c5c57624d4e9c7105c20f131c5e to your computer and use it in GitHub Desktop.
Save Cleffy/a2a85c5c57624d4e9c7105c20f131c5e to your computer and use it in GitHub Desktop.
Tutorial: Breaking down a matching URL regex

Tutorial: Matching a URL regex

This tutorial explains how to use a regex to match a url.

Summary

This breaks down the regular expression(regex) that matches a URL.
Refer to the reference below on what the various operators mean.
Here is a breakdown of the matching URL regex:

/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/

Step One

We look at the anchors to separate the beginning of the expression and the end of the expression:

Beginning - (https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?

Ending - No operations

Step Two

We break down the expression into capturing groups, literals, and order of evaluation.

  • (https?:\/\/)?

  • ([\da-z\.-]+)

  • \.

  • ([a-z\.]{2,6})

  • ([\/\w \.-]*)*

  • \/?

Step Three

We can breakdown what each expression is telling us.

  • (https?:\/\/)?

    • https Literal Characters 'https'
    • ? Lazy Match Operator Only look for the first instance of 'https'
    • :\/\/ Literal Characters and Escape Operator for '://'
    • Find the first instance of https://
  • ([\da-z\.-]+)

    • [] Bracket Expressions Compare the character to whats within the bracket
    • \d Character Class Matches any numeric digit
    • a-z Literal Character Range Match any lower case letter
    • \. Escape Operator for period
    • - Literal Character for dash
    • + Quantifier One or more characters match the query
    • Have atleast one digit, lowercase letter, period, or dash (This would be the domain name)
  • \.

    • \. Escape Operator for period
    • Have a period. This would seperate the domain name from the domain type (domain.com, domain.net, ...)
  • ([a-z\.]{2,6})

    • [] Bracket Expressions Compare the character to whats within the bracket
    • a-z Literal Character Range Match any lower case letter
    • \. Escape Operator for period
    • {2,6} Quatifier Between two and six characters
    • Match between 2 and 6 lower case letters or periods (This would be the domain type)
  • ([\/\w \.-]*)*

    • [] Bracket Expressions Compare the character to whats within the bracket
    • \/ Escape Operator for slash
    • \w Character Class matches any alphanumeric character including the underscore
    • \. Escape Operator for period
    • - Literal Character for dash
    • * Quantifier zero or more characters match the query
    • Match any number of slashes, alphanumeric characters, periods, or dashes any number of times
  • \/?

    • \/ Escape Operator for slash
    • $ Quantifier Matches zero or one time
    • It can end with a slash

RegEx Reference

Regex Components

Literal Characters

Literal characters are evaluated as is. For instance if the regex is " a " it will search the text for the character 'a'.
They can be defined in a range using a " - ". For instance [a-z] defines all lowercase letters.

Escape Operator

The escape character " \ " is used to define a literal character that is used as a different operator. For instance " \. " defines a period.

Anchors

Anchors define a point to start looking from.

^ - anchors to the start of a string.
$ - anchors to the end of a string.

Quantifiers

Quantifiers are a way to match a range of any characters.

* - The query can be matched any number of times.
+ - The query must match atleast one.
? - The query matches zero or one time.
{} - The query matches a range of times.
{n} - match n times
{n, } - match atleast n times
{n, m} - match n to m times

OR Operator

With the OR - " | " operator, the query on either side of the OR can match.

Character Classes

Used as a shorthand for literal characters.

. - Matches any character except a newline.
\d - Matches any numeric digit.
\w - matches any alphanumeric character including the underscore equivalent to [A-Za-z0-9_]
\s - Matches a single whitespace including tabs and line breaks

Flags

Flags are optional and specify the scope of a search placed at the end of the expression

g - Global search
i - Case insensitive search
m - Multi-line search

Grouping and Capturing

Groups of regex expressions are within parenthesis. Groups multiple search expressions as one.

Bracket Expressions

Expressions are within brackets. This groups several search terms.

Greedy and Lazy Match

A greedy expression seeks the maximum possible match for a given expression.
A lazy expression uses the " ? " to end the query at the first possible match.

Boundaries

\b - Defines a word boundary. Where any word would end the search for instance dog\b would work with Houndog, but \bdog would not.
\B - Defines a non-word boundary. Where any non-word would end the search for instance \Bdog would match Houndog, but not dog.

Back-references

\# - Repeats a match from a previous group '#'.
For instance \1 would reference the first group.

Look-ahead and Look-behind

?=Something - Look-ahead - Checks to see if what follows a match is equal to Something.
?<=Something - Look-behind - Checks to see if what is before a match is equal to Something.

Author

GitHub - Cleffy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment