The following gist gives a thorough description of the central components in matching a given URL utilizing regular expressions in Javascript.
Regex uses a sequence of characters to define a specific search pattern. In the URL matching regex, regex matches any valid URL.
/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/
This regex answers the basic, and all important question:
-Is the input in question a valid URL?
This regex can have a wide variety of applications, the most salient of which is input data validation.
- Anchors
- Grouping and Capturing
- Quantifiers
- OR Operator
- Character Classes
- Flags
- Bracket Expressions
- Greedy and Lazy Match
- Boundaries
- Back-references
- Look-ahead and Look-behind
While this expression may seem obscure upon initial inspection, let's see if we can't break it down into its individual elements in order to understand how the sequence works.
The two principal anchors in this regex expression are the ^
at the beginning and the $
at the end which constitute an exact string match with the components included within the two anchors. When used alone, the ^
anchor matches any string that begins with the characters that follow the anchor. The $
matches any string that ends with the characters that precede it. By enclosing the regex between these two anchors, we are asking the search function to match exactly is included between them (what it begins AND ends with).
So what is included between the two anchors. If we examine the expression, we can see that there are a number of components separated by parentheses ()
. Parentheses are used in regex to create separate groups of interest. Within each of these groups, there is a regex that we may look at separately to see what is evaluated. These include:
- the initial
https
component:(https?:\/\/)
- the domain name (e.g.
www.google
, orpets
):([\da-z\.-]+)\.
- the top level domain (
.com
,.gov
, etc):([a-z\.]{2,6})
- the file path:
([\/\w \.-]*)*
Notice that some of the components of capture groups end with a ?
or a *
. These are generally known as quantifiers. Quantifiers are used to define the number of times a given expression may be identified. The ?
makes a single instance of the character preceding the quantifier optional, whereas the *
makes multiple instances of the characters preceding the quantifier optional.
For example, the grouping
(https?:\/\/)?
contains two ?
quantifiers. This expression is looking for an http://
OR an https://
. For this reason a single internal s
is optional. The same is true for the entire expression included in the parenthesis, for which reason it is followed by a ?
. In other words, a valid URL may begin with http://
OR https://
, or it may not begin with either of them at all (the input begins with www.
). The same applies for the /
at the very end of the expression.
Similarly, the *
makes the expression optional, but in this case it may be one or more instances that are optional. So if we look at the fourth grouping:
([\/\w \.-]*)*
The expression is allowing for any number of filepath characters that may follow an initial specified domain.
Finally, the {}
quantifier defines a range of possible instances where a match may be identified. In evaluating the Top level domain, the regular expression allows for the top level domain to consist of 2 to 6 characters.
The main OR operator used in the above regex is the []
. The expression will match for any characters or character classes included in the brackets. For example the [\da-z\.-]
expression matches for any digits (\d
) OR any characters between a and z (a-z
) OR any '.' (\.
) OR any '-' (-
).
The main character classes to consider in the above expression include the \d
character class which looks for any digit, and the \w
character class that looks for any alphanumeric character.
This regex tutorial was created by Eduardo Castro. Github: mambru82