Matching a URL using regular expressions - a tutorial

The following gist gives a thorough description of the central components in matching a given URL utilizing regular expressions in Javascript.

Summary

Regex uses a sequence of characters to define a specific search pattern. In the URL matching regex, regex matches any valid URL.

/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/

This regex answers the basic, and all important question:

-Is the input in question a valid URL?

This regex can have a wide variety of applications, the most salient of which is input data validation.

Anchors
Grouping and Capturing
Quantifiers
OR Operator
Character Classes
Flags
Bracket Expressions
Greedy and Lazy Match
Boundaries
Back-references
Look-ahead and Look-behind

Regex Components

While this expression may seem obscure upon initial inspection, let's see if we can't break it down into its individual elements in order to understand how the sequence works.

Anchors

The two principal anchors in this regex expression are the ^ at the beginning and the $ at the end which constitute an exact string match with the components included within the two anchors. When used alone, the ^ anchor matches any string that begins with the characters that follow the anchor. The $ matches any string that ends with the characters that precede it. By enclosing the regex between these two anchors, we are asking the search function to match exactly is included between them (what it begins AND ends with).

Grouping and Capturing

So what is included between the two anchors. If we examine the expression, we can see that there are a number of components separated by parentheses (). Parentheses are used in regex to create separate groups of interest. Within each of these groups, there is a regex that we may look at separately to see what is evaluated. These include:

the initial https component: (https?:\/\/)
the domain name (e.g. www.google, or pets): ([\da-z\.-]+)\.
the top level domain (.com, .gov, etc): ([a-z\.]{2,6})
the file path: ([\/\w \.-]*)*

Quantifiers

Notice that some of the components of capture groups end with a ? or a *. These are generally known as quantifiers. Quantifiers are used to define the number of times a given expression may be identified. The ? makes a single instance of the character preceding the quantifier optional, whereas the * makes multiple instances of the characters preceding the quantifier optional. For example, the grouping

(https?:\/\/)?

contains two ? quantifiers. This expression is looking for an http:// OR an https://. For this reason a single internal s is optional. The same is true for the entire expression included in the parenthesis, for which reason it is followed by a ?. In other words, a valid URL may begin with http:// OR https://, or it may not begin with either of them at all (the input begins with www.). The same applies for the / at the very end of the expression.

Similarly, the * makes the expression optional, but in this case it may be one or more instances that are optional. So if we look at the fourth grouping:

([\/\w \.-]*)*

The expression is allowing for any number of filepath characters that may follow an initial specified domain.

Finally, the {} quantifier defines a range of possible instances where a match may be identified. In evaluating the Top level domain, the regular expression allows for the top level domain to consist of 2 to 6 characters.

OR Operator / Bracket expressions

The main OR operator used in the above regex is the []. The expression will match for any characters or character classes included in the brackets. For example the [\da-z\.-] expression matches for any digits (\d) OR any characters between a and z (a-z) OR any '.' (\.) OR any '-' (-).

Character Classes

The main character classes to consider in the above expression include the \d character class which looks for any digit, and the \w character class that looks for any alphanumeric character.

Author

This regex tutorial was created by Eduardo Castro. Github: mambru82

mambru82/url-regex-tutorial.md