Skip to content

Instantly share code, notes, and snippets.

@vitkarpov
Created July 6, 2018 19:16
Show Gist options
  • Save vitkarpov/d1c96c3750fe06d9a4b150f93dd3b5bd to your computer and use it in GitHub Desktop.
Save vitkarpov/d1c96c3750fe06d9a4b150f93dd3b5bd to your computer and use it in GitHub Desktop.
How do I make a tokenizer faster?

Hello software engineers and computer scientists! :)

I'm working on open source JavaScript project which has css parser as one if its modules. Node --perf shows me that the cricical part is REGEXP which is used to find the next word boundary.

const RE_WORD_END = /[ \n\t\r\f\(\)\{\}:;@!'"\\\]\[#]|\/(?=\*)/g;

// pos - current boundary
// next - next boundary which the code has to find

RE_WORD_END.lastIndex = pos + 1;
RE_WORD_END.test(css);
if ( RE_WORD_END.lastIndex === 0 ) {
    next = css.length - 1;
} else {
    next = RE_WORD_END.lastIndex - 2;
}

My first guess was that regexps are slow and I refactored this part as follow:

const WORD_END_CHARS_CODES = [ // sorted list of char codes which correspond to word end symbols ]

for (let i = pos + 1; i < css.length; i++) {
    if (binarysearch(WORD_END_CHARS_CODES, css.charCodeAt(i))) {
        return i - 1;
    }
}
return css.length - 1;

It became slower! (~1.5x slower, 6ms vs 10ms).

Calling to parsers ninjas, what approach should I try?

PS. The project is postcss if you'd ask me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment