Blog 2019/9/1

Matching a string literal using regex

updated 2025/4/9 to add a Python demo

When implementing a regex-based lexer / tokenizer, coming up with a regex which matches string literals can be a bit tricky.

Every time I do this, it has been long enough since my previous attempt that I've forgotten the particulars. So this is note a to my future self.

Note: it can be tricky find the right phrase to put into google to find good resources for this. Searching for "string literal regex" seems to work well.

The naive string matcher

You'll probably start with this:

"[^"]*"

A string is:

the opening quote
zero of more of
- any character other than a quote
the closing quote

This works for simple strings, like "I said hello to the baker.".

However, it breaks for strings which contain other strings, for example "I said \"Hello!\" to the baker.". This would match two strings:

"I said \"
" to the baker."

A matcher which handles embedded strings

Here's an approach which seems to work:

"([^"\\]|\\.)*"

A string is:

the opening quote
zero or more of:
- either:
  - any character other than a quote or backslash
  - a backslash followed by any character
the closing quote

One way to think of this is that we disallow backslashes unless they are followed by another character. So, "\" is not a valid string, while "\\", "\t" and "\"" are valid strings.

The final matcher

But there is one last gotcha: the . doesn't mean "any character", it actually means "any character other than a newline", so this regex won't work with multi-line strings.

The fix is to replace . ("any character other than a newline") with [\s\S] ("either a whitespace or non-whitespace character"):

"([^"\\]|\\[\s\S])*"

Python demo

Let's write a Python program which will break a file up into string and non-string chunks, and print the chunks one per line.

That is, given the file input.c:

char* str = "I said \"hello\" to the cat."; // comment

it should print:

char* str = 
"I said \"hello\" to the cat."
; // comment

We can use Python's re.split(), but its default behavior is exclude the delimiters from the result, which means we'd end up with:

char* str = 
; // comment

However, if the regex uses a capture group, re.split() will include the captured portion of the delimeters.

So we need to make the following changes:

modify the existing parenthesis to indicate they are not a capture group, by using (?:
- "(?:[^\"\\]|\\[\s\S])*"
wrap the entire regex in another set of parenthesis to make the whole thing a capture group
- ("(?:[^\"\\]|\\[\s\S])*")
represent it as a Python raw string literal:
- r'("(?:[^\"\\]|\\[\s\S])*")'

Here's our demo script, chunkify.py:

#!/usr/bin/env python3

import sys
import re

def string_chunkify(text):
    "break 'text' into string and non-string chunks"
    str_regex = r'("(?:[^\"\\]|\\[\s\S])*")'
    return re.split(str_regex, text, flags=re.MULTILINE)

if __name__ == "__main__":
    with open(sys.argv[-1]) as fd:
        text = fd.read()
    for chunk in string_chunkify(text):
        print(chunk)

And here it is in action:

$ ./chunkify.py input.c
char* str = 
"I said \"hello\" to the cat."
; // comment

Thanks

Thanks to:

cellularmitosis/README.md

Matching a string literal using regex

The naive string matcher

A matcher which handles embedded strings

The final matcher

Python demo

Thanks

ghost commented Dec 8, 2022

janligudzinski commented Dec 26, 2022

Masihtabaei commented Apr 8, 2025

cellularmitosis commented Apr 10, 2025