readable regex

Most languages allow a flag called the ignore-whitespace or free-spacing flag. Most of the major languages support it for their regular expression implementations -- with Javascript being the outlier here, as usual. Ruby, PHP, Elixir, etc., all support it. Java too.

What it does is makes it so that when the engine sees whitespace, it ignores it entirely.

So say we wanted to parse some HTML with regular expressions, like this:

<table>
    <thead>
        <tr>
            <th>Id</th>
            <th>Name</th>
            <th>Birthday</th>
        </tr>
    </thead>

    <tbody>
        <tr>
            <td>100</td>
            <td>Harry James Potter</td>
            <td>July 31</td>
        </tr>
        <tr>
            <td>200</td>
            <td>Ron Weasley</td>
            <td>January 12</td>
        </tr>
        <tr>
            <td>201</td>
            <td>Hermione Granger</td>
            <td>July 5</td>
        </tr>
    </tbody>
</table>

For the sake of example, let's pretend we want the id, last name, and birth day sans the month of students born in July, and for some truly horrifying reason we only have the HTML table and can't access the database the info must've come from.

# before the whitespace flag....
~r/<td>(\d+)<\/td>\s*<td>.+\b([a-zA-Z\-]+(?=<\/td>))<\/td>\s*<td>July\ (\d+)<\/td>/

Totally readable, right? No, not really, and I REALLY like regular expressions. And hopefully we just KNOW how to count capture groups. Also hopefully we can read escape slashes when they're all clustered, huh? Also hopefully we never need to look in other tags and if we do boy I hope we remember to close all of them, and remember to change every single instance of td in there....

Here's the same regex, but with named groups -- defined at the outset, like variables -- and comments. (PCRE, and only PHP that I know of, has a variety of verbs you can use as well, including a particularly handy DEFINE directive inside the regex that lets you declare subroutines, etc. But you can fake it like below.)

The (?x) at the start is an inline modifier for the whitespace flag; I like to do that instead of doing it at the end, because odds are high I want it if I am doing a multiline one, and if I am doing a multiline one my odds of forgetting that flag are SUPER high.

And because syntax highlighting is best: https://regex101.com/r/gj5oXl/2

~r{(?x)

(?<tag_to_look_inside>    td   ){0}
(?<open>   <  \g<tag_to_look_inside>  >   ){0}
(?<close>  </ \g<tag_to_look_inside>  >   ){0}

\g<open>
  (?<id_number>
     \d+
  )
\g<close>
\s* # consumes any possible whitespace

\g<open>
  .+   # match anything after the opening tag
  \b   # and a word boundary

  # By having the lookahead after the <last_name> group, we tell the engine
  #    that we want it to grab a set of letters with possibly a hyphen which
  #    occurs immediately before a <close>.
  
  (?<last_name>
    [a-zA-Z\-]+ # grab letters or hyphen(s)
  )
  
  # this is the lookahead
  (?=\g<close>)
\g<close>
\s*

\g<open>
  July\s   # <- since we are ignoring whitespace now, it needs to be explicit when we ARE matching it
  (?<numeric_day_of_birth>
    \d+
  )
\g<close>
}

There are a few things to know when you use the x flag:

You need to explicitly spell out your whitespace. ie, ~r{orange apple}x would match "orangeapple" but not "orange apple" -- for "orange apple", you would want ~r{orange\sapple}x which is pretty unreadable until you remember the point is to ignore whitespace... so add some! Then you get ~r{orange \s apple}x.
Some syntactic things have to not have whitespace inside them. e.g., orange (?= \s apple) is fine, but orange (? = \s apple) won't compile. A lookahead is (?= whatever you want to be ahead here) -- so the initial (?= has to be one unit. The same is true for lookbehinds, conditionals, etc.

aleph-naught2tog/regex_whitespace.md

Select an option

No results found

Select an option

No results found

readable regex