Blog 2019/9/1
<- previous | index | next ->
updated 2025/4/9 to add a Python demo
When implementing a regex-based lexer / tokenizer, coming up with a regex which matches string literals can be a bit tricky.
Every time I do this, it has been long enough since my previous attempt that I've forgotten the particulars. So this is note a to my future self.
Note: it can be tricky find the right phrase to put into google to find good resources for this. Searching for "string literal regex" seems to work well.
You'll probably start with this:
"[^"]*"
A string is:
- the opening quote
- zero of more of
- any character other than a quote
- the closing quote
This works for simple strings, like "I said hello to the baker."
.
However, it breaks for strings which contain other strings,
for example "I said \"Hello!\" to the baker."
.
This would match two strings:
"I said \"
" to the baker."
Here's an approach which seems to work:
"([^"\\]|\\.)*"
A string is:
- the opening quote
- zero or more of:
- either:
- any character other than a quote or backslash
- a backslash followed by any character
- either:
- the closing quote
One way to think of this is that we disallow backslashes unless they are followed by another character.
So, "\"
is not a valid string, while "\\"
, "\t"
and "\""
are valid strings.
But there is one last gotcha: the .
doesn't mean "any character",
it actually means "any character other than a newline",
so this regex won't work with multi-line strings.
The fix is to replace .
("any character other than a newline")
with [\s\S]
("either a whitespace or non-whitespace character"):
"([^"\\]|\\[\s\S])*"
Let's write a Python program which will break a file up into string and non-string chunks, and print the chunks one per line.
That is, given the file input.c
:
char* str = "I said \"hello\" to the cat."; // comment
it should print:
char* str =
"I said \"hello\" to the cat."
; // comment
We can use Python's re.split()
, but its default behavior is exclude the delimiters from the result,
which means we'd end up with:
char* str =
; // comment
However, if the regex uses a capture group, re.split()
will include the captured portion of the delimeters.
So we need to make the following changes:
- modify the existing parenthesis to indicate they are not a capture group, by using
(?:
"(?:[^\"\\]|\\[\s\S])*"
- wrap the entire regex in another set of parenthesis to make the whole thing a capture group
("(?:[^\"\\]|\\[\s\S])*")
- represent it as a Python raw string literal:
r'("(?:[^\"\\]|\\[\s\S])*")'
Here's our demo script, chunkify.py
:
#!/usr/bin/env python3
import sys
import re
def string_chunkify(text):
"break 'text' into string and non-string chunks"
str_regex = r'("(?:[^\"\\]|\\[\s\S])*")'
return re.split(str_regex, text, flags=re.MULTILINE)
if __name__ == "__main__":
with open(sys.argv[-1]) as fd:
text = fd.read()
for chunk in string_chunkify(text):
print(chunk)
And here it is in action:
$ ./chunkify.py input.c
char* str =
"I said \"hello\" to the cat."
; // comment
Thanks to:
Thank you for this!! Saved me so much time!