Skip to content

Instantly share code, notes, and snippets.

@cellularmitosis
Last active April 10, 2025 00:22
Show Gist options
  • Save cellularmitosis/6fd5fc2a65225364f72d3574abd9d5d5 to your computer and use it in GitHub Desktop.
Save cellularmitosis/6fd5fc2a65225364f72d3574abd9d5d5 to your computer and use it in GitHub Desktop.
Matching a string literal using regex

Blog 2019/9/1

<- previous | index | next ->

Matching a string literal using regex

updated 2025/4/9 to add a Python demo

When implementing a regex-based lexer / tokenizer, coming up with a regex which matches string literals can be a bit tricky.

Every time I do this, it has been long enough since my previous attempt that I've forgotten the particulars. So this is note a to my future self.

Note: it can be tricky find the right phrase to put into google to find good resources for this. Searching for "string literal regex" seems to work well.

The naive string matcher

You'll probably start with this:

"[^"]*"

A string is:

  • the opening quote
  • zero of more of
    • any character other than a quote
  • the closing quote

This works for simple strings, like "I said hello to the baker.".

However, it breaks for strings which contain other strings, for example "I said \"Hello!\" to the baker.". This would match two strings:

  • "I said \"
  • " to the baker."

A matcher which handles embedded strings

Here's an approach which seems to work:

"([^"\\]|\\.)*"

A string is:

  • the opening quote
  • zero or more of:
    • either:
      • any character other than a quote or backslash
      • a backslash followed by any character
  • the closing quote

One way to think of this is that we disallow backslashes unless they are followed by another character. So, "\" is not a valid string, while "\\", "\t" and "\"" are valid strings.

The final matcher

But there is one last gotcha: the . doesn't mean "any character", it actually means "any character other than a newline", so this regex won't work with multi-line strings.

The fix is to replace . ("any character other than a newline") with [\s\S] ("either a whitespace or non-whitespace character"):

"([^"\\]|\\[\s\S])*"

Python demo

Let's write a Python program which will break a file up into string and non-string chunks, and print the chunks one per line.

That is, given the file input.c:

char* str = "I said \"hello\" to the cat."; // comment

it should print:

char* str = 
"I said \"hello\" to the cat."
; // comment

We can use Python's re.split(), but its default behavior is exclude the delimiters from the result, which means we'd end up with:

char* str = 
; // comment

However, if the regex uses a capture group, re.split() will include the captured portion of the delimeters.

So we need to make the following changes:

  • modify the existing parenthesis to indicate they are not a capture group, by using (?:
    • "(?:[^\"\\]|\\[\s\S])*"
  • wrap the entire regex in another set of parenthesis to make the whole thing a capture group
    • ("(?:[^\"\\]|\\[\s\S])*")
  • represent it as a Python raw string literal:
    • r'("(?:[^\"\\]|\\[\s\S])*")'

Here's our demo script, chunkify.py:

#!/usr/bin/env python3

import sys
import re

def string_chunkify(text):
    "break 'text' into string and non-string chunks"
    str_regex = r'("(?:[^\"\\]|\\[\s\S])*")'
    return re.split(str_regex, text, flags=re.MULTILINE)

if __name__ == "__main__":
    with open(sys.argv[-1]) as fd:
        text = fd.read()
    for chunk in string_chunkify(text):
        print(chunk)

And here it is in action:

$ ./chunkify.py input.c
char* str = 
"I said \"hello\" to the cat."
; // comment

Thanks

Thanks to:

Copy link

ghost commented Dec 8, 2022

Thank you for this!! Saved me so much time!

@janligudzinski
Copy link

Very cool, thanks! Comes in useful for all sorts of source code processing - I was trying to deobfuscate some code when I ran into this problem.

@Masihtabaei
Copy link

Thanks so much! 😊
Parsing string literals is one of those things that seem deceptively simple—but once you dive in, it turns out to be surprisingly tricky.

@cellularmitosis
Copy link
Author

@janligudzinski @Masihtabaei I updated the gist to include a working Python demo :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment