I've been building a number of parsers in Rust lately while studying or doing code challenges. One of my side projects that involved parsing is weechat-notifier. I set about building the parser in the most intuitive way to me as a primarily JavaScript developer these days.
Before we get into the mistakes I made, lets talk about the protocol I'm parsing. The WeeChat relay protocol is an interesting one. It has a set of primitives and uses those to dynamically build up types. This means you can parse it without any knowledge of the possible data structures. This means libraries can be built in robust ways to support future versions of WeeChat without having to update themselves!
There are many other protocols that carry their data structure metadata on the
wire with them, but this was the first time I'd built a parser for one of them
in a typed language. I set about building up an enum
of the primitives,
getting tests passing for them, then realizing from there the parsing was nearly
done. Since the data was positional in the structures, I could very simply throw
it in a Vec
and be done!
These are the types I came up with and have a working implementation of:
#[derive(Debug)]
pub struct WeechatMessage {
pub id: String,
pub data: Vec<WeechatData>,
}
#[derive(PartialEq, Eq, Clone, Debug)]
pub enum WeechatData {
Char(char),
Int(i32),
Long(i64),
String(String),
StringNull,
Buffer(String),
BufferNull,
Pointer(String),
Time(String),
Array(Vec<WeechatData>),
Hdata(String, Vec<WeechatData>, Vec<HashMap<String, WeechatData>>),
}
Pretty simple types! WeechatMessage
is just simple struct
and WeechatData
has the minimal set of types to represent the primitives. Unfortunately the use
of multiple Vec
and a HashMap
means a lot of checked access in Rust. Code
using the resulting data structures was very cumbersome to write, requiring a
lot of double checking the protocol and the type system didn't really help me at
all.
I'm sure the more experienced typed programmers are shaking their head knowingly, or hissing at the dynamic kids on their lawns or whatever they do for fun. Honestly this kinda killed the project for me for a couple months. I built up a whole parser and I needed to throw away so much code and build it to use concrete types so users wouldn't be so burdened. It also made me sad that I'd have to give up future compatibility.
The thought came that I could have an Unknown
type and have it use the dynamic
structure while having concrete types being emitted. This idea got shattered
when I realized I'd be bumping major version every time I moved a type from
Unknown
into a concrete type. I didn't want to place a different more
treadmill like burden on my users either.
This morning as I sat down to my normal Saturday hacking sessions at my local cafe I realized I had a better solution. Since the messages all had names, I could have the parser be instantiated with an optional list of message names to be parsed in the dynamic style. This means users who opt into messages types that aren't fully supported yet don't get burned when I update the library.
Thinking about this more, the parser could take two lists, a concrete and a dynamic. Parsing and emitting only the message types specified. Also this means I get to keep my dynamic parser and just build up a concrete parser along side of it sharing in the lower level parts.
Thoughts and comments are very welcome! I'm still learning to let go of habits built up from years of Python and JavaScript development and could always use pointers.
@experquisite is https://github.com/serde-rs/serde closer to what you are looking for?