Skip to content

Instantly share code, notes, and snippets.

@jishanshaikh4
Forked from cobalt88/Regex URL.md
Last active June 15, 2022 17:50
Show Gist options
  • Save jishanshaikh4/fe0fadcffece7cf6965d5265815ab725 to your computer and use it in GitHub Desktop.
Save jishanshaikh4/fe0fadcffece7cf6965d5265815ab725 to your computer and use it in GitHub Desktop.
an example of cmmmon reget

Regex Breakdown : URL

Table of Contents

What is Regex?

Regex stands for Regular Expression, and a regular expression is a string of text and special characters used to algorithmically search, compare, and match defined values or patterns in a string. In layman's terms, Regex is a powerful search, filter, and validation tool. Lets look at some examples to better understand how regex works and what it can be used for.

How it works

Regex can be a little much to take in all at once, and they can range in complexity from something as simple as a regex to match a simple string such as /([a-z])/ which will match any lower case letters from a-z all the way up to extreemely complex searchin and matching patterns that can be thousands of characters in legth.

The following regular expression may look a bit intimidating, but its actually not that complicated, this particular regex will match a valid url or web address. for example: www.example.web | http://www.example.web | https://www.exmaple.web | https://example.web | http://blog.example.web are all valid URL formats, and the following regex example will return all of these as a valid match when compared.

  /^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/

First thing to point out is that every regex is contained in a boundary defined by two forward slashes:

  /regex goes here/

there are some options for parameters that go outside of that boundary but this one doesn't use them so we will leave that alone for now, just be aware that they do exist.

Next lets break that long string down into more manageable chunks

       capture group 1    capture group 2   capture group 3  capture group 4  escaped (/)   __end of expression
              |                  |                  |               |           |          |
/   ^   (https?:\/\/)   ?  ([\da-z\.-]+)  \. ([a-z\.] {2,6}) ([\/\w \.-]*)  *  \/   ?      $ /
    |                   |                  |                                |       |
match at beginning   Preceding group    escaped period             match 0 or    optional
of the string only      is optional                                more of the 
                                                                   preceding token

The first thing to look at here is the capture groups. Capture groups are like smaller containers of parameters to search for and match.

Breaking Down The Capture Groups

  • Capture Group 1:
 (https?:\/\/) 

This capture group is is looking specifically for https:// or http:// however, since / is used by regex to do something else normally it has to be "escaped" with a \ to escape a character is to use it as its string character, not as a special regex character. the ? is a special quantifier basically saying that the preceding character is optional by allowing the expression to match zero or one of the preceding character, in this case its referring to the letter s, since both http:// and https:// are valid options for starting a URL string depending on whether or not the endpoint uses a security certificate or enforces https protocols. Additionally the entire capture group is optional since the group is followed directly byt another ? because not every URL has to start with http:// or https://

To summarize: capture group 1 can be broken down as follows:

match all char     Escape the 
groups that        following 
= http             characters
   |               |   |
( http  s ?        \ /  \ / )
          |          |    |
match 0 or 1 of   Escaped characters
the preceding     (/ and / are 'escaped')
character (s)

  • Capture Group 2:
  ([\da-z\.-]+)

This capture group has a very different set of parameters than group 1. Lets start with the [ ] . In a regex the [ ] are used to define a character set or range. For example [a-z] will match any character between a and z (regex is case sensitive however url's are not, which is why it only uses lower case letters in this particular regex).

inside of the [ ] there is also a \d , you may be tempted to think this means the letter d is an escaped character, however it is not. The character d does not signify anything special in regex on its own and therefore is always treated as a character, unless it is preceded by a \

\d is a regex character class that defines a digit, therefore the expression [\d a-z] will match any character that is either a digit or a lower case letter.

Then we have an escaped period: \. a hyphen: - and finally another new quantifier: +

The + quantifier basically means that it will be allowed to match one or more of the preceding token. In this case the preceding token is a character set that matches any string containing more than 1 character comprised of any digit 0-9, any lower case letter, period, or hyphen. Some valid matches can look like: .ab-c9, ., tech-blog., about

Group 2 can be broken down as follows:

Opening of   any lower case    closing of
Char Set     letter from a-z   char set
  |            |                 |
( [  \d       a-z  \.       -    ]  +  )
      |            |        |       |
     Any        escaped  hyphen    plus 
    Digit       period           quantifier

  • Capture Group 3:
  ([a-z\.]{2,6})

Capture group 3 shares a lot of the same functionality as group 2. It has a character ser or character range [a-z] selecting any lower case characters from a through z. After the character set is declared there is another familiar parameter, the escaped period \., and lastly we have a new a new quantifier: { }. The curly brackets are used specifically to declare how many of the preceding token must be matched.

Opener     Separator      Closer
  |           |             |
  {     2     ,      6      }
        |            |
     Minimum       Maximum

In this particular example the above quantifier will match a minimum of 2 and a maximum of 6 of the preceding token [a-z\.] which means any string comprised of a-z and a . will be a valid string or a valid match. For example: abz, .com, qwert., lkjhgf are all considered valid. However if you have more than 6 or less than 2 characters or include numbers or upper case letters the string will be considered invalid.

Keep in mind this quantifier does not always have to be written with 2 values, it can be written in the following ways:

  {1,10} - will match a minimum of 1 of the preceding token and a maximum of 10
  {1,} - will match 1 or more of the preceding token with no maximum
  {1} - will match exactly 1 of the preceding token, no more no less.

Depending on the situation the above quantifier options can be very useful.

In summary, group 3 is broken down as follows:

                Min/Max
 Char Range    quantifier
      |            |
  ( [a-z  \. ]   {2,6}  )
          |            
      Escape following
      character (.)

Capture Group 4:

  ([\/\w \.-]*)

Last but not least we have the 4th capture group in our regex example. This one again has many of the same things we have already covered and a few new things to look at. First up is the character range [\/\w \.-] which we can break down pretty easily by now. \/ is an escaped forward slash, \w is an escaped w, and \. is an escaped ., and -. so the range or set of valid characters in this group are: /, w, ., and -. After the rage/set declaration we have a new quantifier: * which means that the overall string has to have 0 or more of the preceding token. This is a kind of shorthand for {0,}, which would also work, but why write out more when you don't need to. As a side note this entire capture group is required to create a valid match, a minimum of 1 valid token would need to be included in the string being searched. Examples: www., /, ., -

In summary, group 4 can be broken down as follows:

 Escaped characters   Match 0 or more
  ( \,  w,  .)        of the preceding token
    |   |   |            |
( [ \/  \w  \.  - ]      *  )
                |
            Hyphen character

More Stuff

Hopefully by now you are able to understand how this particular regular expression works and you know some of the basics of regex. There is a lot more to regex than this relatively simple example we broke down here but its entirely too much to put in this particular breakdown.

If you are interested in learning more I recommend www.regexr.com as an excellent starting point to your learning path for regex. Also keep an eye out in the future for other tutorials or break-downs of common and or useful regular expressions that I plan to do in the future.

Until next time, Have Fun!

Definitions

  • Regex: stands for regular expression. A regular expression is a series of characters that specifies or defines a search pattern to be executed on a series of text (STRING).

  • Character Class: In the context of regular expressions, a character class is a set of characters enclosed within square brackets. It specifies the characters that will successfully match a single character from a given input string.

  • Anchors: In regex an anchor is a token that does not match any characters, but instead assert something about the string or the matching process.

Example: ^ is an anchor to assert the start of the string, and $ asserts the end of the string.

  • Quantifiers: A regex quantifier is used to determine how many of something should be matched, some examples include * and {2,9}

  • Capture Groups: A regex capture group is a way to treat multiple characters as a single unit or string. They are created by placing the characters to be grouped inside a set of parentheses.

About The Author

This document was autored by Vincent Teune,
GitHub Profile: https://github.com/cobalt88
GitHub Gists: https://gist.github.com/cobalt88
Email: [email protected]

@cobalt88
Copy link

just a heads up, this gist has recently been updated to include a lot more information. The version you have here is incomplete.

@jishanshaikh4
Copy link
Author

jishanshaikh4 commented Jun 15, 2022

just a heads up, this gist has recently been updated to include a lot more information. The version you have here is incomplete.

Your comment will remind me of that. Thanks for the update. Good work. Appreciated!
Edit: updated the gist.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment