Streaming Parsers in Strongly Typed Languages

I've been building a number of parsers in Rust lately while studying or doing code challenges. One of my side projects that involved parsing is weechat-notifier. I set about building the parser in the most intuitive way to me as a primarily JavaScript developer these days.

Before we get into the mistakes I made, lets talk about the protocol I'm parsing. The WeeChat relay protocol is an interesting one. It has a set of primitives and uses those to dynamically build up types. This means you can parse it without any knowledge of the possible data structures. This means libraries can be built in robust ways to support future versions of WeeChat without having to update themselves!

There are many other protocols that carry their data structure metadata on the wire with them, but this was the first time I'd built a parser for one of them in a typed language. I set about building up an enum of the primitives, getting tests passing for them, then realizing from there the parsing was nearly done. Since the data was positional in the structures, I could very simply throw it in a Vec and be done!

These are the types I came up with and have a working implementation of:

#[derive(Debug)]
pub struct WeechatMessage {
    pub id: String,
    pub data: Vec<WeechatData>,
}


#[derive(PartialEq, Eq, Clone, Debug)]
pub enum WeechatData {
    Char(char),
    Int(i32),
    Long(i64),
    String(String),
    StringNull,
    Buffer(String),
    BufferNull,
    Pointer(String),
    Time(String),
    Array(Vec<WeechatData>),
    Hdata(String, Vec<WeechatData>, Vec<HashMap<String, WeechatData>>),
}

Pretty simple types! WeechatMessage is just simple struct and WeechatData has the minimal set of types to represent the primitives. Unfortunately the use of multiple Vec and a HashMap means a lot of checked access in Rust. Code using the resulting data structures was very cumbersome to write, requiring a lot of double checking the protocol and the type system didn't really help me at all.

I'm sure the more experienced typed programmers are shaking their head knowingly, or hissing at the dynamic kids on their lawns or whatever they do for fun. Honestly this kinda killed the project for me for a couple months. I built up a whole parser and I needed to throw away so much code and build it to use concrete types so users wouldn't be so burdened. It also made me sad that I'd have to give up future compatibility.

The thought came that I could have an Unknown type and have it use the dynamic structure while having concrete types being emitted. This idea got shattered when I realized I'd be bumping major version every time I moved a type from Unknown into a concrete type. I didn't want to place a different more treadmill like burden on my users either.

This morning as I sat down to my normal Saturday hacking sessions at my local cafe I realized I had a better solution. Since the messages all had names, I could have the parser be instantiated with an optional list of message names to be parsed in the dynamic style. This means users who opt into messages types that aren't fully supported yet don't get burned when I update the library.

Thinking about this more, the parser could take two lists, a concrete and a dynamic. Parsing and emitting only the message types specified. Also this means I get to keep my dynamic parser and just build up a concrete parser along side of it sharing in the lower level parts.

Thoughts and comments are very welcome! I'm still learning to let go of habits built up from years of Python and JavaScript development and could always use pointers.

wraithan/streaming-parser.md

Streaming Parsers in Strongly Typed Languages

wraithan commented Jan 10, 2016

Uh oh!

RAOF commented Jan 10, 2016

Uh oh!

the-kenny commented Jan 11, 2016

Uh oh!

wraithan commented Jan 11, 2016

Uh oh!