elm-format parsing overhaul doccumentation

Background

Historically the Elm compiler implemented it's parsers using the libraries parsec and indents (in this document parsec is used to refer to both of these libraries), since elm-format is based on the Elm compiler it has inherited this as well. However, in Elm 0.19 the parsing codebase was overhauled, the Elm compiler got it's own parsing primitives and all of the parsers where rewritten to use these primitives instead of parsec. To integrate these changes into elm-format is non trivial and elm-format has continued using parsec.

This situation is not ideal, there is great value in having the parsing logic for the Elm compiler and elm-format be as similar as possible from a maintainability standpoint. Moreover, the primitives in the Elm compiler are more elegant and performant in the context of parsing Elm code. For the long term elm-format needs to use the primitives and build upon the parsers in the Elm compiler!

The parsec-like adapter module

A parsec-like adapter module has been made for elm-format that can be found in elm-format-lib/src/Parse/ParsecAdapter.hs. The API of this module is very similar to that of parsec, but the functions are actually implemented using parsing primitives from the Elm compiler. In this way it has been possible to completely remove the dependency on parsec while making minimal changes to existing parsers in elm-format.

The parsec-like API is similar to parsec to the extent that if you'd like documentation on a function found in the parsec adapter your best bet is to look for the equivalent function in the parsec docs.

The parsec-like API also uses the same parser type used in the parsing primitives from the Elm compiler, so using Elm compiler-style parsers inside parsec-like ones and vice versa is relatively seamless. There is a slight incompatibility in the error types though, parsec-style is limited to one specific error type whereas Elm compiler-style uses multiple different error types. How this is dealt with is detailed in the Adapting the errors section.

At the end of the day, you shouldn't really use or have to pay this module much mind.

Adapting Elm compiler parsers for use in elm-format

For a concrete example of how a parser from the Elm compiler has been adapted I'd recommend looking at this PR that adapts the number parser.

Bellow the process of adapting parsers is described. Note that this only describes how things have been done which can be useful to know, but it's not final in any way, feel free to explore other approaches!

Choosing an appropriate parser

The parsec-like adapter API makes it possible incrementally replace parsec-style parsers with ones from the Elm compiler, still care needs to be taken to adapt parsers in a order that makes sense. For example, it makes sense to replace elm-formats number parser with an adapted number parser from the Elm compiler. The literals parser would then be changed to use this new parser. The opposite does not make sense, to adapt a literal parser from the Elm compiler but use the old char, string and number parsers inside of it. It's in other words best to start with "leaf nodes" if that's an analogy that makes sense. In some cases there isn't even a one-to-one mapping, for example the Elm compiler actually lacks a parser for literals altogether.

Fetching the dependencies

When you've identified a parser you'd like to adapt you need to copy it and any code the it depends on from the Elm compiler. To deal with the dependencies depending on other modules and so on we've taken the approach of commenting out any declaration not needed by the parser we are adapting. The rationale being that in the future we might needs these commented out declaration.

Adapting the parsing logic

There are some things that elm-format parses that the Elm compiler does not. This elm-format specific logic will have to be added to parsers from the Elm compiler. Some additional things that elm-format parses:

Comments. elm-format has way more sophisticated logic for parsing comments than the Elm compiler. It probably won't make sense to adapt any of the comment parsing logic from the Elm compiler, instead a new comment parser would need to be built from scratch.
Newlines. In some places like case statements and let declarations elm-format tracks newline information. Parse.ParsecAdapter.getState and Parse.ParsecAdapter.updateState are used to deal with newline tracking, so if you see these functions in an old elm-format parser then that's a good indication that newline tracking is being used.
Old Elm syntax and lenient parsing. elm-format can parse old versions of Elm, and in many cases it can even parse code with syntax from different versions mixed together. elm-format can also do lenient parsing, if you where to accidentally create a new record with this { foo : 1, bar : "baz" } the Elm compiler would throw a syntax error (you should use = instead of : when creating records) but elm-format handles it. If you forget to add features like this when adapting a parser from the Elm compiler elm-format's test suite will most likely capture this.

Adapting the AST

At the time of writing, the character, string and number parsers are the only parsers from the Elm compiler that have been adapted, and they don't touch the AST that much. But there are some differences we can expect:

GADTs. If you take a look at elm-format-lib/src/AST/V_016.hs and go to the definition for the AST type you will see some code that will look very archaic unless you are familiar with generalized algebraic data types in Haskell. If that's the case then I'd recommend you do some research in whatever way suits you until you get to the point where you can start making small changes to the AST.
Comments. Already discussed in the previous section
Text representation. The Elm compiler represents text (Elm string content, variable names, module names, etc.) as byte arrays whereas elm-format uses regular Strings which are more abstract.

We can also expect minor differences like variant naming and such.

Adapting the errors

The Elm compilers parsers all have their own error types representing all possible syntax errors that parser can encounter. elm-formats parsec-style parsers represent errors using a single error type that's basically a tree-like structure of strings with metadata, not at all as precise! So whenever a parsec-style parser uses an Elm compiler-style parser we need to convert the error. The leaf nodes in a parsec-style error are defined by Message.

-- elm-format-lib/src/Reporting/Error/Syntax.hs

data Message
  = Message     !Prelude.String -- raw message
  | Expect      !Prelude.String -- expecting something
  | UnExpect    !Prelude.String -- unexpected something
  | CharError     Char
  | StringError   String
  | NumberError   Number

Message, Expect and UnExpect are the "regular" parsec leaf nodes whereas CharError, StringError and NumberError are used to attach errors from Elm compiler-style parsers (note that here Char and String refer to error types, not the types from Prelude). So if you wanted to adapt the errors from the case statement parser you might add a new variant CaseError Case. You can refer to elm-format-lib/src/Parse/Literal.hs to get an idea of what this error wrapping looks like inside of a parser.

emmabastas/README.md