I've been working on a parser for a Haskell-like syntax called Duet. In the implementation I've taken particular care to make an awesome tokenizer and parser that is super helpful to learners.
Jasper Van der Jeugt made a talk about producing good error messages recently, which coincides nicely with my parallel work on this. So I thought I'd also share my results. I don't have time to document in detail my approach, as I'm using that spare time to work on this project, but I can show off some examples.
I'll list some example inputs and parse errors, and compare with GHC's parsing of Haskell:
x =
unexpected end of input
expecting ‘case’, lambda expression (e.g. \x -> x), if expression
(e.g. ‘if p then x else y’), infix expression (e.g. x * y), function
expression, variable (e.g. ‘foo’, ‘id’, etc.), character (e.g. 'a'),
string (e.g. "a"), integer (e.g. 42, 123), decimal (e.g. 42, 123) or
constructor (e.g. Just)
GHC: parse error (possibly incorrect indentation or mismatched brackets)
The above demonstrates the standard Parsec "unexpected/expected" error message, that builds up a set of expectations based on your parser.
x = if
unexpected end of input
expecting condition expresion of if-expression
GHC: parse error (possibly incorrect indentation or mismatched brackets)
This demonstrates specific expectations to sub-parts of an expression.
x = f (x * )
unexpected closing parenthesis ‘)’
expecting right-hand side of ‘*’ operator
This demonstrates customizing the error message based on what the user
has input (the *
operator). Duet doesn't have sections.
x = f 'a b c
unexpected character: you wrote
'a b c…
but only one character is allowed inside single quotes, like this:
'a'
Perhaps you forgot to put the closing single quote?
You may also have meant to use double quotes for text, e.g.
"a b c"
GHC: Syntax error on 'a
This is particularly detailed; it shows what was written, what is expected, and what the user might have meant.
x = f (if x
then undefined
)
unexpected closing parenthesis ‘)’
expecting function arguments, infix operator or ‘else’ keyword for
if-expression
GHC: parse error on input ‘)’
if/else
is a tricky thing to parse and retain decent error messages,
but the above holds up well.
x = f (x*)
unexpected operator ‘*’, there should be spaces before and after operators.
In Duet, all operators must have a space before and afterwards. This
lets us write 10 - -5
without issue.
x = f (x * -5 + 2)
unexpected operator ‘+’. When more than one operator is used
in the same expression, use parentheses, like this:
(x * -5) + ...
Or like this:
x * (-5 + ...)
GHC:
Precedence parsing error
cannot mix ‘*’ [infixl 7] and prefix `-' [infixl 6] in the same infix expression
(This is one of GHC's good parser messages).
In Duet, there is no operator precedence. All operator combinations must be parenthesized. This makes it easier for learners to read code, and easier to edit. It makes the parser simple, but that's not a major leg-up.
In this error message, I use the inputed code to suggest two alternative ways to make their code correct.
x = f 42. x
unexpected " "
expecting decimal component, e.g. 42.0
This is a tokenization message, so is this one:
x = k -k
unexpected "k"
expecting number (e.g. 123) or space after operator ‘-’
This one is particularly good, because it tells you two ways to correct your program.
There are a few key points in achiving a decent parser like this:
- Tokenize before you parse.
- Avoid
try
(also mentioned in Jasper's post, and before that Brent Yorgey's). I usedtry
in my tokenizer on single strings e.g.then
so that if you call a variablethe
, the keyword doesn't greedily try to take it. The parser itself doesn't usetry
at all. - Describe everything you're parsing. That is, wrap everything in
(<?>)
. - Make custom parse conditions to handle common user errors. Parsec gives you a large degree of freedom to do this.
- Test BAD CODE on your tokenizer/parser. Don't just feed correct code into your parser while working. Give it garbage or "nearly right" things and use the feedback you get to guide your writing of the parser.
I think you have to do more than describe the grammar to have a good parser, it should anticipate the user's mistakes.
@FranklinChen imo, relevant error message are too hard for machine learning. a long list of heuristics (built on "tree regexes" or something) would go a long way.