Skip to content

Instantly share code, notes, and snippets.

@hybee234
Last active December 28, 2023 02:22
Show Gist options
  • Save hybee234/3d4b0d0bb6dedb3c2437102706bbc4af to your computer and use it in GitHub Desktop.
Save hybee234/3d4b0d0bb6dedb3c2437102706bbc4af to your computer and use it in GitHub Desktop.
Regex Tutorial

Regular Expressions Tutorial

It is commonplace in everyday computer use to utilise the "find" function to locate a match in a body of text. Asking the computer to "find" text is would typically mean providing literal characters for the computer to search in a body of text to highlight results (e.g. "cat").

Regular Expressions, often referred to as Regex, is a powerful tool that allows the user to step outside the boundaries of literal character search and opens up the option for users to additional tools to create more complex matching.

When thinking about Regex, it is best to understand them as string of specialised characters that describes a pattern of characters for the system to find and match in a given string or body of text.

Regex will feel cryptic to anyone who hasn't been exposed to them, but with some guidance and practice, you'll understand enough to start to make use of them.

Regex is valuable in find and replacing text (e.g. in the case of redacting information in documents), and also in programs where user input into fields and forms need to be validated (e.g. email checks, mobile number checks, password complexity checks)

Summary

This tutorial will aim to cover key concepts of regex to work towards understanding the below regex that is used to match for email addresses:

/^([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,6})$/

Table of Contents

Regex functions used in the email example

Literal and meta characters

Understanding the concept of literal and meta characters is the first step of regex. In a typical scenario of using a "find" function, the string of text entered in the field are composed of "literal" characters, that is, characters that the system will match in the sequence provided as they are (e.g. "cat").

In addition to literal characters, regex also makes use of, "meta characters", consider these as special characters that represent groups of characters that changes what the system will match against then.

For example, introducing a "\d" in a regex represents a match for any digit 0-9. A basic regex expression of "\d\d\d" would ask the system to match for 3 digits between 0-9 appearing sequentially in a body of text

Commonly used meta characters are:

Meta character Description
\d any digit 0-9
\w any word character - a-z, A-Z, or 0-9 (not symbols)
\s space (space bar, tab)

Literal and meta characters can be used in combination as needed.

For example, a regex expression of #\d\d\d would match a literal '#' followed by any 3 digits.

Bracket Expressions

There may be times where you want the system to match any character within a selection of meta characters e.g. perhaps you want the system to match any word character or an exclamation mark. This would normally be represented by a \w meta character and a literal '!' (exclamation mark). Simpy creating a regex as "\w!" will not work as the system will look for a "\w" character in position 1 followed by an "!" in position 2.

This problem is solved by the use of square brackets to denote that any character(s) within the brackets can be in the charater position.

Example 1: The below asks the system to match for a word character or an exclamation mark

[\w!]

Example 2. A below example regex asks the system to match in sequence,

  1. Any digit 0-9,
  2. Any digit 0-9,
  3. A space or tab,
  4. Any word character, or underscore, or exclamation mark, or "at" symbol, or "hash" symbol
\d\d\s[\w_!@#]

Character Classes

Character classes are regex codes that represent a range of characters. The examples of \d (digits), \w (word) and \s (space) above are actually character classes.

Here is the table again with additional Character classes:

Character Class Description
\d any digit character 0-9
\w any word character - a-z, A-Z, or 0-9 (not symbols)
\s space (space bar, tab)
\D Not a digit character 0-9
\W Not a word character - a-z, A-Z, or 0-9 (not symbols)
\S Not a space (space bar, tab)

The below are character classes that must be wrapped within bracket expressions (otherwise they can be interpreted as literals)

Character Class Description
[a-z] any lower case character a-z (note you can modify range)
[A-Z] any upper case character a-z (note you can modify range)
[0-9] any digit 0 through 9 (note you can modify range)

To combine these character classes they can be written next to one another

e.g The below regex will match for a lower or upper case letter A through Z or a digit from 0 to 5

[a-zA-Z0-5]

Full stops "." are a special meta character that will match any character, in order to match a "literal" full stop, a backslash is placed in front of it.

Character Description
\ . (no space) Literal full stop
. Match any character

Quantifiers

A regex search will often require matching the same group of characters a certain number of times (e.g. it may be optional (i.e. match zero times), it could a specific number of times, and so on). Attempting to code this in regex one character at a time won't always be possible, can be tedious and prone to error.

Quantifiers can be used to solve this problem. Quantifiers are a way to build in the number of times the preceeding element is matched against. (Note: Quantifiers themselves don't match against characters)

Quantifier Description
* Zero or more of the preceeding character
+ One more of the preceeding character
? Optional, the preceeding character is optional
{min, max} minimum and maximum character match
{n} Number of times characters have to match to qualify

Example 1. Match any number of lower case characters, followed by a literal hyphen, followed by any number of lower case characters. Potential matches are "rainbow-coloured", "hello-", "-goodbye"

[a-z]*-[a-z]*

Example 2. Match literal letters, the letter 'u' is optional and match any number of word characters after 'r'. Will match "colour, color, colourful, coloring"

colou?r\w*

Example 3. Match 4 digits. Will match '1234'

\d{4}

Example 4. Match 2 to 6 characters - lower case a-z or literal full stop. Will match '.com', 'ab.net', 'hello.', 'banana'

[a-z\.]{2,6}

Anchors

There are times where you may want to ensure you are matching the first and/or last character of a row of text. Anchors are a way to indicate the position of a search.

Anchor Description
^ The following element must be at the start of a string
$ The preceeding element must be at the end of a string

Note that, much like quantifiers, anchors are not matched against themselves, they only modify the behaviour of the element that appears immediately before them.

Example 1: The regex below indicates that it must be at the start of a string and must have at least one character which is a lower case letter, a digit from 0-9, an underscore, a full stop (note the backslash to denote literal full stop) or a hyphen.

^[a-z0-9_\.-]+

Grouping and Capturing

Regex enables grouping sections of an expression for separate assessment or later use. This is done with the use of parentheses

Consider the example regex below that matches a typical landline number: an open bracket, 2 digit area code, closed bracket, 4 digits, hyphen and 4 digits.

Note that the parentheses that appear in this example have been turned into literal brackets by the preceeding backslash \( and \)

^\([\d]{2}\)[\d]{4}-[\d]{4}$

The above regex will match a phone number like this: (03)9765-0987

If you happen to want to extract just the digits out in groups for later use (e.g. in a find and replace function or to store the value in a variable), then this can be done using grouping and capturing.

Firstly, place parentheses within your regex to capture the sections of interest. See below (note how the parentheses are excluding the characters that are not of interest)

^\(([\d]{2})\)([\d]{4})-([\d]{4})$

Here is some formatting to highlight the subgroup parentheses

^\( ([\d]{2} )\) ([\d]{4} )- ([\d]{4} )$

The groups can then be referenced as $1, $2, $3 and so on (in order of grouping appearing from left to right). Note that $0 will always refer to the original string

A good use case for this feature is in find and replace exercises where the original regex can find matches in a body of text, and the replace can utilise subgroups to reformat or arrange characters

In the below example, if a replace is used then all numbers will appear without break 0397650987

$1$2$3

You can get more creative and use alternate replaces

e.g. the below will return Area code: 03, Phone: 9765 0987

Area code: $1, Phone: $2 $3

Deciphering the email regex

We have reached the point of utilising all you have learnt so far to decipher the email example. Here is the regex again:

`/^([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,6})$/`

The forward slashes / that appear at the beginning and end of the regex indicate to javascript that this is a regex (this is suggests this regex is going to be used in javascript as part of a validation step)

The regex also begins and ends with anchors ^ and $ indicating that the string mustn't have spaces in it. (This is typically expected in an email as spaces are not allowed)

The "body" of the regex remains:

([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,6}) 

At a macro level we can see that there are 3 subgroups contained within parentheses ($1, $2, $3) separated by some literal characters - a literal @ and a literal full stop .

I'll simplify it to the below for ease of navigating through this example

$1@$2\.$3

Let's now address each subgroup:

Subgroup 1

[a-z0-9_\.-]+

Subgroup $1 is a bracket expression that matches for one or more characters that must be one of:

  1. lower case letter (a-z)
  2. digit (0-9)
  3. underscore (_)
  4. full stop (\.)
  5. hyphen (-)

Note: examples of failed matches are capital letters or spaces

Subgroup 2

[\da-z\.-]+

Subgroup $2 is a bracket expression that matches for one or more characters that must be one of:

  1. digit (\d)
  2. lower case letter (a-z)
  3. full stop (\.)
  4. hyphen (-)

Note: examples of failed matches are capital letters and underscore

Subgroup 3

[a-z\.]{2,6}

Subgroup $3 is a bracket expression that matches if there are 2 to 6 characters (inclusive) that must be one of:

  1. lower case letter (a-z)
  2. full stop (\.)

Note: examples of failed matches are if the text is too short (1 character) or too long (beyond 6 characters), or if digits are used

Combining all of the above together, it is expected that the below strings will match

[email protected] [email protected] [email protected]

And there we have it, hopefully this tutorial has provided a useful introduction to regex concepts and use cases that you one day use in your projects!

If you'd like to practice writing and testing regex, a great website to use is: https://regex101.com/

Other regex functions

The example regex to identify emails utilises the most common features of Regex. Read on to learn about additional features.

OR Operator

The OR operator is also known as alternation. This regex feature allows coding in options to match against, options are captured in parentheses and are separated by the pipe character "|"

Example 1. If we wanted to specify the options for the end of the email we can subsitute subgroup 3 with alternation.

The below example indicates that the end of the email is either com, net or, au

([a-z0-9_\.-]+)@([\da-z\.-]+)\.(com|net|au)

This will match the below emails:

[email protected] [email protected] [email protected]

Note: the parentheses will also automatically turn the alternation feature into a subgroup which can be referenced as needed. In the above example, the parentheses with alternation is subgroup 3 ($3)

Flags

Flags are a special feature that changes the overall behaviour of the regex.

Perhaps the two most common flags are the "Ignore Casing" i and "Global" g flags. Ignore case makes the search case insensitive. The Global flag is typically used in programming to ask the search to find all occurences of matches (instead of only the first match)

Flags are appended on the end of the regex after the forwardslash / that denotes the end of the regex.

Note: you can set multiple flags by continuing to append them to the end of the regex

Example: the below regex will search for all occurances (g) of words with one or more characters (a-z) and will ignore case (i)

/[a-z]+/gi

e.g. a string of Hello DOG cat rAinBow will return 4 matches

There are 6 flags used in javascript:

Flag Description
i search is case-insensitive: no difference between A and a
g search looks for all matches, not just the first
m multi-line mode
s dotall mode - allows a full stop to match a newline character
u unicode support mode
y sticky mode

More information under the heading "flags" here: https://javascript.info/regexp-introduction

Greedy and Lazy Match

The greedy and lazy match describes two modes that a quantifier in a regex can run in.

In short, a regex with a greedy match will match as many characters as possible, one with a lazy match will match the least number of characters possible.

To provide an example of a greedy match and lazy match, consider the below body of text:

The "chess-master" decided today was "too hot" to play and "forfeited"

A the below regex is a greedy match - literal double quotation ", and at least (+) one of any character (.), finishing with another double quotation "

".+"

This will match the below string (from the first double quotation to the last spanning across the entire row)

"chess-master" decided today was "too hot" to play and "forfeited"

To change the regex to a lazy match, a question mark ? is inserted after the quantifier

".+?"

This will return the shortest response (and match the 3 separate quotations)

"chess-master",
"too hot" and,
"forfeited"

Boundaries

Regex provides the ability to align the match to the start and/or end of a word. This is achieved by making use of word boundaries, denoted by \b

Word boundaries are not matched against themselves but affect the matching of the regex.

Example - the below will match a 5 letter word exactly (and not just the first 5 letters of a longer word)

\b[a-zA-Z]{5}\b

This is particulary useful to filter out matches that are not the whole word

Back-references

Back-references is a method for the regex to call on subgroups as part of the regex itself.

Back-references are called with a backslash (e.g. \1, \2, \3) as opposed to the dollar sign (e.g. $1, $2, $3) when subgroups are referenced outside of the regex itself

Back references can be used to identify accidental double-ups of words. The below example will search for words that apear twice separated by a space

(\b[a-zA-Z]+\b)\s\1

See also, Grouping and Capturing

Look-ahead and Look-behind

Look-ahead and Look-behind features is a way to modify matches depending on characters immediately after or before a potential match, respectively.

Example - Lookahead, the below example will only match digits that have an exclamation mark (!) after them.

\d(?=!)

Example - Lookbehind, the below example will only match digits that have a dollar sign ($) before them.

(?<=\$)\d

For more detailed information, please visit this site: https://javascript.info/regexp-lookahead-lookbehind

Author

This Regex Tutorial was written by Huber. Please let me know if it has been helpful to you!

Visit my GitHub profile here: https://github.com/hybee234

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment