Skip to content

Instantly share code, notes, and snippets.

@0x5742
Created June 16, 2015 19:18
Show Gist options
  • Select an option

  • Save 0x5742/3be88f906955ee18d213 to your computer and use it in GitHub Desktop.

Select an option

Save 0x5742/3be88f906955ee18d213 to your computer and use it in GitHub Desktop.
lightweight regex lexer
import re
tokenizer = re.compile(r"""
(?P<string> " (?: [^\\"] | \\. )* " ) |
(?P<number> 0[Xx][0-9A-Fa-f]+ | [0-9]+ ) |
(?P<comment> \# .*? (?= \n | $ ) ) |
(?P<ident> [A-Za-z_] [A-Za-z0-9_]* ) |
(?P<paren> [()] ) |
(?P<brace> [{}] ) |
(?P<bracket> [][] ) |
(?P<eol> ; ) |
(?P<operator> [-+*/!=<>^&|.,~?:]+ ) |
(?P<whitespace> [ \t\n]+ ) |
(?P<invalid> . )
""", re.VERBOSE | re.MULTILINE)
def lex(s):
for m in tokenizer.finditer(s):
k, v = [(k, v) for k, v in m.groupdict().items() if v is not None][0]
if k == 'invalid':
raise SyntaxError("invalid data at position %d" % m.start())
elif k not in ('whitespace', 'comment'):
yield (k, v)
yield ('eof', None) # makes writing parsers a little bit easier
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment