Blog post on unfolding the BitTorrent peer protocol

i'm doing a bittorrent client in erlang

task: parse a lump of raw binary data from a socket into a list of bittorrent messages

my original implementation worked but was messy; the logic for parsing individual bittorrent messages was tangled with the logic for accumulating them into a list [should show this, ought to be somewhere in github.com/happy4crazy/ebc]

@jamii showed me a better way while reviewing my code: use unfold

recall that a fold in functional programming collapses a list of things into a single value (which is perhaps a list):

foldr :: (a -> b -> b) -> b -> [a] -> b
foldl :: (a -> b -> a) -> a -> [b] -> a

foldr f z [] = z
foldr f z (x:xs) = x `f` (f z xs)

foldl f z [] = z
foldl f z (x:xs) = foldl f (f z x) xs

unfolds do the reverse: produce a list of values from a single value

unfoldr :: (a -> Maybe (b, a)) -> a -> [b]

in this case, single value is the lump of binary data; list is the parsings

in words: unfoldr f x applies f to x repeatedly, either getting a value to accumulate and a new x, or nothing, which stops the accumulation.

erlang version also returns the value of the x we choked on (useful in parsing: you want parsings as well as whatever's left over because perhaps you received a partial message)

unfold(F, X) ->
    unfold(F, X, []).
unfold(F, X, Acc) ->
    case F(X) of
        nothing ->
            {lists:reverse(Acc), X};
        {A, X2} ->
            unfold(F, X2, [A|Acc])
    end.

The power of idioms

code that uses well-known idioms is easier to read: jamie doesn't have to wade through homebrewed tail-recursions; he can trust that the accumulation logic is handled by unfold, and focus instead of the interesting part: how the unfolding function parses a single message

well-known functional idioms are well-known because they have nice algebraic properties, e.g.

map :: (a -> b) -> [a] -> [b]
f :: (a -> b)
map f :: [a] -> [b]
length (map f xs) = length xs
f . f' = id ==> map (f . f') xs = xs

composition: if you need to convert a list to another list, doing so with a map says something interesting: the important part is converting a single element to another single element; the list accumulation stuff is automatic.

Manifest propertes

physicists like to use the phrase "manifestly invariant" to describe formulations of physical laws that "obviously" obey some property

functional idioms make properties manifest, e.g. length invariance with map

writing the parser as an unfold makes manifest that the parsing is context-free: no state is maintained between parsing subsequent messages

it makes manifest that the only interesting part of the parsing is parsing a single message; the accumulation logic is uninteresting

is there anything else to say here? maybe manifest properties for unfold aren't so interesting

Forcing your hand

the thing that really struck me about using unfolds though is how they constrain how you parse a single message: if you want parsing a single message to be unfoldable, it must conform to the unfold interface

map: no contraint on f, f :: a -> b

unfold: not-necessarily obvious constraints, f :: a -> Maybe (b, a)

the funky return type tells you what to do when you fail to parse: return 'nothing'; on success: {Parsing, Rest}

note that when you fail to parse, you don't have to return anything about what you failed to parse; that's handled automatically by unfold

separation of parsing a single message and accumulation details

Code

ebc_peer_protocol:decode/1 is the function we'll unfold

bit syntax in erlang is pretty cool too :)

decode(<<19,
         "BitTorrent protocol",
         Reserved:8/bytes,
         InfoHash:20/bytes,
         PeerId:20/bytes,
         Rest/bytes>>) ->
    {{handshake, Reserved, InfoHash, PeerId}, Rest};
decode(<<0:32,
         Rest/binary>>) ->
    {keep_alive, Rest};
decode(<<Len:4/unit:8,
         Decoding:Len/bytes,
         Rest/bytes>>) ->
    {decode_by_id(Decoding), Rest};
decode(_Bin) ->
    nothing.

decode_by_id(<<0>>) ->
    choke;
decode_by_id(<<1>>) ->
    unchoke;
decode_by_id(<<2>>) ->
    interested;
decode_by_id(<<3>>) ->
    not_interested;
decode_by_id(<<4,
               Index:4/unit:8>>) ->
    {have, Index};
decode_by_id(<<5,
               Bitfield/bytes>>) ->
    {bitfield, Bitfield};
decode_by_id(<<6,
               Index:4/unit:8,
               Begin:4/unit:8,
               Length:4/unit:8>>) ->
    {request, Index, Begin, Length};
decode_by_id(<<7,
               Index:4/unit:8,
               Begin:4/unit:8,
               Block/bytes>>) ->
    {piece, Index, Begin, Block};
decode_by_id(<<8,
               Index:4/unit:8,
               Begin:4/unit:8,
               Length:4/unit:8>>) ->
    {cancel, Index, Begin, Length}.

and here's the actual unfolding (try to ignore the OTP jargon):

handle_info({tcp, _Conn, Data}, State=#s{partial=P}) ->
    {Msgs, NewP} = ebc:unfold(fun ebc_peer_protocol:decode/1, <<P/bytes, Data/bytes>>),
    lists:foreach(fun(Msg) ->
                          ebc_peer_fsm:recv_msg(State#s.fsm, Msg)
                  end,
                  Msgs),
    {noreply, State#s{partial=NewP}};

the process's state includes whatever was left over from parsing the previous lump; this gets tacked onto the beginning of the new lump. unfolding producs a list of Msgs and a new bit of leftover lump, which I store for use when the next lump arrives

cqfd/gist:4049153

The power of idioms

Manifest propertes

Forcing your hand

Code