kat-curio.md

As someone who worked on parsers it's kind of interesting that you picked this particular version of a parsing problem b/c it involves operator precedence, which introduces a fair bit of complexity. The "canonical" solution to operator precedence is Pratt parsing, a variation of which is used in my language parsers. I also just find that algorithm neat, conceptually. But it is kind of hard to wrap your mind around it b/c of its generality.

Two things about parsers, focusing on Pratt parsers, that I have enjoyed

The latter is where I learned how to do Pratt parsing. The former is just a generally very very good blog about PL stuff.

Not super relevant in this particular example, but I think typically "tokenize" just means split into text pieces, and the conversion of things like "10" to the integer value 10 happens in parsing. That's because in more complex grammars in production languages, parsing numbers isn't super trivial. For example, in JS, the string "99." may be followed either by "3" or by "toString()" which would totally change the parse tree (and maybe even tokenization? probably depends on the implementation). Anyway, over time I've move towards keeping tokenization just dealing with strings and parsing everything in a later parse step. Just a tidbit in case it's interesting. For this problem, your choice seems probably well motivated.

A thing you didn't think about in specifying the problem: are negative numbers allowed? If so, how does "- 9" parse? Does "-9" have a wildly different parse tree than "0-9" or "0+-9"?

I think your "do all the * first then do all the +" approach is a good starting point. It's easy to implement and conceptually simple and obviously correct. It also doesn't need parsing! But it's also not very general (breaks down with parens) and more important is very slow -- It's at least O(mn) for expressions of length m with n levels of operator precedence, because you iterate through m tokens every step, n times.

There are a few different ways to think about a better solution that takes one-pass through the tokens, but they all end up being variations of pratt parsing one way or another -- they all depend on the fact that some operators "bind" more tightly than other operators.

For example, you can also think of this problem as trying to turn the "infix" format like 0 + 18 + 2 * 6 into a "postfix" format like "0 18 + 2 6 * +". There are algorithms to do this, but all of them involve iterating through each number and asking "is the operator on my right side higher precedence than on my left?" and doing something different if so.

My conceptual model for parsing infix operators with precedence is this. We iterate through for each number, and...

a * b + c * d * e           You start grouping operators left to right
(a * b) + c * d * e         B "binds" closer to the left than right because * > +
(a * b) + (c * d) * e       C "binds" closer to the right than left because * > +
(a * b) + ((c * d) * e)     (c * d) "binds" to e again, because as a whole the thing on the right "* e" "binds" closer
                            than the thing on the left "(a * b) +".
((a * b) + ((c * d) * e))   Finally you only have two terms, so they're grouped together.

Pratt parsing is just a more general formulation of this. I think the challenge in this problem is converting this intuitive idea of "binding power" of operations into code.

You didn't linger on it but you mentioned reformatting expressions from its token list or its parse tree, which is a fascinating problem to tinker on if you are so inclined. Pretty-printing is what things like Prettier do, and it's a surprisingly deeply researched problem in CS/programming languages.

As a general comment on these types of problems, I think a lot of problems become clearer to think about if you think about them in terms of the data structures you need at each stage to represent the data the right way (here, string -> list -> tree), and then think about algorithms to take you from one intermediate representation to another. In this case, the question is how to go from a list to branches of a tree that respect operator precedence.

thesephist/kat-curio.md