RegEx Tutorial on matching a URL

This is a tutorial with the purpose of introducing some basic concepts of regular expressions, or more commonly known as RegEx in which a pattern string of characters is used to define a specific search pattern to locate and/or validate character sequences in a string. I will break down the RegEx used to validate a URL in order to demonstrate how they can be built and used.

Summary

This tutorial will provide basic regular expression information and then use a given expression to validate a URL to elucidate RegEx pattern matching concepts and syntax.

Sample RegEx for matching and validating a URL:
/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/

We will begin with a brief introduction to some RegEx concepts, then we will apply those concepts to breakdown and understand our selected regular expression with examples.

Terminology
Anchors
Quantifiers
Grouping Constructs
Bracket Expressions
Character Classes
The OR Operator
Flags
Character Escapes
Additional Information on RegEx Basics
Expression Breakdown

Regex Components

Terminology

pattern: regular expression pattern of characters to be matched
string: test string used for pattern matching
character: refers to any single character, can be alphanumeric or a special character (symbol)
digit: any single number character (0-9)
letter: any letter of the alphabet from a to z regardless of case (a-z, A-Z)
alphanumeric: any combination of letters, regardless of case, or digits
symbol: any special character (!$%^&*()_+|~-=`{}[]:”;'<>?,./)
whitespace: includes spaces, tabs and in some cases line breaks

Anchors

Anchors signify the beginning and end of a RegEx pattern string. The ^ anchor is used at the beginning of a string and the $ at the end.
Note: In JavaScript, regular expressions are wrapped in forward slashes (/).

Quantifiers

Quantifiers are wrapped in curly braces ({}) and represent the length of the string you want to match.
Quantifiers are by default, greedy, meaning they will automatically match the maximum number of characters possible. If you wish to make the search lazy, you can use the ? symbol after the expression to search for the minimum occurrance instead.
The commonly used wildcard character (*) will match the pattern zero or more times. The plus sign (+) will match one or more times while the question mark (?) will match zero or one at most. Quantifiers are stored in curly braces with digits to represent the number of matches desired as follows:

Syntax	Description
{n}	Match exactly n number of times
{n, }	Match at least n number of times
{n , m}	Match minimum of n times, up to a maximum of m

Grouping Constructs

Many desired patterns will have subexpressions, like a phone number is divided into sets of required numbers of digits in specific places. Subexpressions can be identified with parenthesis and separated by a colon, like (425):(555-). You can think of parenthesis in RegEx like an algebra equation where parenthesis group your expressions into sections.

Bracket Expressions

Patterns to match are placed inside square brackets ( [ ] ). All characters or sets of characters identified inside the brackets will be matched. Unless combined with a quantifier, a bracket represents a single character in the pattern, for instance [0-9] matches a single character of any digit, while [A-Z][A-Z] would match any two capital letter characters like an abbrieviation for a State like 'AZ' or 'WA'.

Character Classes

There are multiple character classes used in RegEx:

Literal characters are literally translated to a specifc character or set of characters representing alphanumeric and special characters. Examples of literal expressions include [abc] or [405-].
Note, matches grouped in square brackets are most commenly used for matches. Parenthesis are used for capturing groups for reuse. More can be found on this topic here.
Meta Characters are used to represent generalized patterns like a string of alphanumeric or a string of digits. There are 15 special metacharacters that must be escaped in order to be used outside of bracketed statements (see escaped characters below for more).

These are:

Syntax	Description
( )	Open and closing parenthesis
[ ]	Open and closing brackets
{ }	Open and closing curly braces
\	The backslash character itself
^	The start anchor, caret character
+	The plus sign

Character sets are typically wrapped in brackets to identify any of the included characters are accepted. For instance, [bcfhmprs]at would match words like 'bat', 'cat', 'fat', 'hat', 'pat', but not dat or zat. Character sets wrapped in parenthesis are matched exactly as is so (abc) will match 'abc' but not 'acb'.

Note: RegEx is case senstive so unless the case insestive flag is set (see below), only lower case characters will match so Cat would not be a match.

Character sets can include a range of alphanumeric characters by using a hypen (-) to identify the range. For instance [a-f] will include all lower case letters from a to f while [A-Z] will include all capital letters of the alphabet. Digits can be identified in ranges also, for instance [0-9] would include all digits. Mixed ranges can be set by stacking ranges, for instance all alphanumeric pattern would be [a-zA-Z0-9]. Symbols would need to be explictly identified in a range, as in [!@#$%^&*()]. All of these examples will still match ONLY a single character.

To match a repeating set of characters, you can use multiple ranges as in an area code as [0-9][0-9][0-9] would match '425' or '208'. If you want to match longer sets you can use curly braces ({}) to set a quantifer on the set (see above) as in [0-9]{3} or \d{3}.

String patterns can be represted using these character sets:

Syntax	Description
/abc/	A sequence of characters
/[abc]/	Any character from a set of characters
/[^abc]/	Any character not in a set of characters
/[0-9]/	Any character in a range of characters
/x+/	One or more occurrences of the pattern x
/x+?/	One or more occurrences, nongreedy
/x*/	Zero or more occurrences
/x?/	Zero or one occurrence
/x{2,4}/	Two to four occurrences
/(abc)/	A group
/a	b
/\d/	Any digit character
/\w/	An alphanumeric character
/\s/	Any whitespace character
/\t/	A tab
/\n/	A new line
/\r/	A carriage return
/./	Any character except newlines
/\b/	A word boundary
/^/	Start of input (start anchor)
/$/	End of input (end anchor)

While negations of character strings can be expressed with these sets:

Syntax	Description
\D	A character that is not a digit
\W	A nonalphanumeric character
\S	A nonwhitespace character, any character except for newline

Note: You need to use the backslash () to escape special characters (see below).

The OR Operator

The pipe symbol (|) is used as a logical OR within the expression using parenthesis which identify specific sets of literals. This means either/or is acceptable but not anything which isn't explictly included in the set, like a word that can be spelled two ways as in gr(e|a)y will match both gray and grey, but not griy.

Flags

Flags are placed at the end of the pattern string, on the outside of the slashes, to define additional functionality or limits on the regular expression. The three most common flags are 'g', 'i' and 'm'. Using a flag would simply require adding it to the end of the expression like '/pattern/flags' for example a global flag would look like /\w+\s/g.

Syntax	Description
g	Global search to include all possible matches in the string
i	To make the search case insensitive
m	To make the search a multiline search

Character Escapes

You need to use the backslash () to escape special characters, for example to turn a quantifier symbol into a literal, unless it is used inside of brackets.
.*\ escaped special characters. For example, to use the asterisk, you would type '*'.

Additional Information

There are many great tutorials and learning resources on RegEx available on the internet. These were referenced and helpful for putting togehter this tutorial:
Coding Bootcamp RegEx Tutorial
Mozilla JS RegEx Guide
Eloquent JavaScript 3rd Ed, Chapter 9
Anatomy of a URL

Expression Breakdown

Test patterns

The RegEx pattern we will breakdown should be able to pick up all legal URL patterns, including:

Test case	Example
Standard URL	http://www.google.com
Secure URL	https://www.google.com
Non qualified URL	google.com
Case insentive URLs	www.URL.com, uRl.com or www.Url.com
All alphanumeric combo URL	abc123.com
Short alt ending URL	http://urlwithshortalt.io
Longer/subdomain URL	http://longer.url.com
Longer/subdomain secure URL	https://longer.url.com
API URL	http://apiurl.com/notes
API URL with additional data	http://apiurl2.com/category/4
API URL with parameters	https://apiurl3.com/products?category=4&id=3
URL with fragment/ID tag	http://urlwithid.com#description
URL with port	http://urlwithport.com:8080

ACTUAL EXPRESSION to pattern match:

/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/gi

To define what each section is responsible for matching, we can break this long pattern matching string into subsets. The top line contains the full subset, then the remaining table entries define what each matching set/character, literal, meta character or qualifier is defining.

GROUP 1

SET	DEFINITION
Full set
`/^`	Starting section includes initialization forward slash and start of string anchor.
	Components
/	identifies the string will be a regular expression
^	denotes the official start of the expression to match

GROUP 2

SET	DEFINITION
Full set
`(https?:\/\/)?`	parenthesis indicate this is a subset of the full pattern
	Components
(	start of group
http	literal character so must match exactly
s?	literal with qualifier of can match one or none so can include an s or not
://	literal colon and two escaped forward slashes will match to literal ://
)	end of group
?	quantifier added to end of group to indicate this grouping is optional so it would match to both https://www.google.com as well as www.google.com.

GROUP 3

SET	DEFINITION
Full set
`([\da-z\.-]+)\.`	parenthesis indicate this is a subset of the full pattern, characters following are literals
	Components
(	start of group
[\da-z.-]	is a bracketed matching group so any accepted character can be any member of this set including any digit (\d), any letter (flag at end makes it non-case sensitive), and can include a dot or dash separator between sets within this group.
+	makes previous bracketed set greedy, so it will match for any number of matches with a minimum of one character
)	end of group
.	must include a dot literal next so subset will end with a . then start next pattern set

GROUP 4

SET	DEFINITION
Full set
`([a-z\.]{2,6})`	parenthesis indicate this is a subset of the full pattern
	Components
(	start of group
[a-z.]	is a bracketed matching group so any accepted character can be any letter or a dot but not a digit
{2,6}	is a qualifier for length so it must be at least 2 characters long but not more than 6, so .io, .com, and multi level like .ac.in or .co.uk. Note: this search limits length of the TLD to 6 though it could be increased to current longest TLD of 24 characters and still be accurate.
)	end of group

GROUP 5

SET	DEFINITION
Full set
`([\/\w \.-])\/?`	parenthesis indicate this is a subset of the full pattern, characters following are literals and qualifiers
	Components
(	start of group
[/\w .-]	Group will start with a forward slash (/) then can contain any combo of non-whitespace characters, dots or additional forward slashes. This allows for file paths and filenames to follow the initial URL.
*	The wildcard character allows any number or none of these characters in this group to be used so it can be left off or used to build up filenames with file paths or other nested URL endings.
)	nd of group
*/	Additional wildcard allows repeating of the previous group, each ending with a forward slash so you can build up file paths or other nexted structures like '/mydocuments/homework/project3/index.html'
?	quantifier for last character allowing one or none of the forward slash

GROUP 6

SET	DEFINITION
Full set
`$/`	Ending section includes final forward slash and end of string anchor.
$	End anchor, signifies the end of the expression pattern string.
/	identifies the regular expression string is ending

FLAGS

Full Set
`gi`	Full Set of flags come after the end of the expression, after the final forward slash
g	Global search will match all occurrences so all possible matches will be tested
i	Case insenstive, so both upper and lower case will match letter patterns

Author

My name is Sheri Elgin and I am an IT Professional currently learning Full Stack Development through the UW Coding Bootcamp. You can check out my progress on my Github repo. Thanks for stopping by.

grudgecat/regex_url.md