Last active
June 29, 2021 12:01
-
-
Save jitchie/d27f361da7d9919d57a31f0bae49e7e3 to your computer and use it in GitHub Desktop.
URL matching regex - Gist
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# URL Matching Regex | |
This GIST will endevour to explore the systematic exposition of the ideas and theorys surrounding regular expressions, using URL matching throughout this example. | |
## Summary | |
Regular expressions, or regex for short, are a series of literal characters, special characters and position fields that define a search pattern. These regular are not exclusivley tied to any programing laugage (JS,PYTHON,C++) | |
and are widley used throughout the coding community for validation (of input fields passwords and numerous other working examples). | |
Take the following example of a regular expression, which we’ll call “Matching a URL”: | |
`/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/` | |
This series of characters above might look like nonesense, but it is actually a search pattern meant for basic validation of | |
a URL, that is it checks to see if a string fulfills the basics requirments of a URL. | |
- the URL can start with literal 'HTTP://' or 'HTTPS://' (zero or one with the ?) | |
- With or without the previous token (HTTP:// or HTTPS://) the URL can have any number of characters or digits followed by a '.' or '-' | |
- Followed by a literal '.' (using an escape '\') | |
- Followed by any digit or lower case literal (between 2 or 6 times) | |
- followed by a repeated capturing group which will only capture the last iteration an unlimited amount. (.nsw.gov.au/) | |
- followed by an optional forward slash. (http://www.google.com/, http://www.google.com) | |
This Gist will break this down further | |
## Table of Contents | |
- [Anchors](#anchors) | |
- [Quantifiers](#quantifiers) | |
- [Grouping Constructs](#grouping-constructs) | |
- [Bracket Expressions](#bracket-expressions) | |
- [Character Classes](#character-classes) | |
- [The OR Operator](#the-or-operator) | |
- [Flags](#flags) | |
- [Character Escapes](#character-escapes) | |
## Regex Components | |
A regex is considered a literal, so the pattern must be wrapped in slash characters (for javascript). If we examine the “Matching a URL” regex, you'll see that this is true: | |
`/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/` | |
Note: JavaScript provides two ways to create a regex object. The first, shown in our example, uses literal notation. The second is to use a RegExp constructor. The constructor function's parameters are not enclosed within slashes; instead, they use quotation marks. To learn more, review the MDN documentation at https://developer.mozilla.org/en-US/ or https://www.w3schools.com/ and search regex. | |
### Anchors | |
The characters ^ (or caret) and $(dollar) are both considered to be anchors. | |
in relation to our example | |
^ marks the start of our validator or string and $ the end. | |
Everything inbetween will be used as paramatres for our 'search' or 'validation' | |
### Quantifiers | |
There are two types of quantifiers in regular expreisions, these are used for matching groups of expressions 0 or x amount of times depending on the operator used. for example; | |
* Match zero or more times. | |
+ Match one or more times. | |
? Match zero or one time. | |
{ n } Match exactly n times. | |
{ n, } Match atleast n times. | |
{ n, m } Match from n to m times. | |
These standard quantifier used within expressions are known as greedy, meaning they match as many as they possibily can, and only returning what is necessary to match the regex. | |
Alternativley there is a subclass of quantifiers that overides the default behavior. | |
Adding a '?' to a quantifier will make it ungreedy i.e lazy. | |
*?, +?, ?? and so on. | |
### Grouping Constructs | |
Capturing groups are a way to treat multiple characters as a single unit. They are created by placing the characters to be grouped inside a set of parentheses (), in refrence to 'Matching URL' we can break down our example into 3 groups. | |
`/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/` | |
group one: | |
(https?:\/\/) will return match for https:// or http:// | |
group two: | |
([\da-z\.-]+) | |
group three | |
([a-z\.]{2,6}) | |
group four | |
([\/\w \.-]*) | |
Together these return a single match for a URL. | |
lets break this down with a real world example using the URL 'https://www.google.com/' | |
group one: https:// | |
group two: www. | |
group three: google. | |
group four: com/ | |
### Bracket Expressions | |
certain named classes of characters are predefined within bracket expressions, as follows. | |
Their interpretation depends on the LC_CTYPE locale; for example, ‘[[:alnum:]]’ means the character | |
class of numbers and letters in the current locale or grouping construct ( ). | |
these can be readily found online in regex cheat sheets for more examples. | |
### Character Classes | |
Expressions inside [] are a special kind of character class. | |
A class expression matches a single character that is contained within the brackets. For example, [xbc] matches "x", "b", or "c" | |
We can look group four between parenthasis for another example: | |
([\/\w \.-]*) | |
() capture group | |
[] denotes class inside the capture | |
in this case can match '/' \w (words, including a-zA-Z0-9) and literal . or - | |
* denotes this can be done an infinite amount of times | |
note: if we insert the caret symbol at the start of character class we are essentially asking to match anything that IS NOT contained withing the class. | |
([^abc]) match anything, just omit "a" "b" "c" from matches. | |
### The OR Operator | |
To achieve OR in regex we can use the pipe key '|' this is a powerful grouping tool in regex used in conjunction with ( ) | |
the syntax is below. | |
### Flags | |
Regular expressions may have flags that affect the search. | |
in Javascriot there are 7; | |
i | |
With this flag the search is case-insensitive: no difference between A and a in regards to matching. | |
g | |
With this flag the search looks for all matches, without it – only the first match is returned. | |
this is also known as the global flag. | |
m | |
Multiline mode, | |
in the multiline mode they match not only at the beginning and the end of the string, but also at start/end of line. | |
d | |
Generate indices for substring matches. | |
s | |
Allows . to match newline characters. | |
u | |
"unicode"; treat a pattern as a sequence of unicode code points. | |
y | |
Perform a "sticky" search that matches starting at the current position in the target string. | |
### Character Escapes | |
The backslash in a regular expression precedes a literal character. You also escape certain letters that represent common character classes, such as \w for a word character or \s for a space. | |
\w is also known as the word character this matches [a-zA-Z0-9] | |
matching all upper and lower case letters and all digits. | |
\s could be a tab or a space bar, any space on the page. | |
## Author | |
Jack Ritchie is a novice coder based in Bondi, Australia. | |
Jack enjoys long walks on the beach and coding. | |
For more infomation: | |
http://www.github.com/Jitchie | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment