Skip to content

Instantly share code, notes, and snippets.

@dinosaure
Created August 2, 2019 14:18
Show Gist options
  • Save dinosaure/40c404e9a0aa834f49c25ec2a411f672 to your computer and use it in GitHub Desktop.
Save dinosaure/40c404e9a0aa834f49c25ec2a411f672 to your computer and use it in GitHub Desktop.

Mr. MIME - Parse and generate emails

I'm glad to announce the first release of mrmime, a parser and a generator of emails. This library provides an OCaml way to analyze and craft an email. Then, the goal is to make the entire stack about email (such as SMTP or IMAP) to be able to provide then tools and unikernels around the email service.

In this article, we will show what is currently possible with mrmime and some others libraries around it and our next plan.

An email parser

Some years ago, I did [a talk][talk-mrmime] about what is really an email. Indeed, beside a human-comprehensible format (or a rich-document as we said a long time ago), an email has several details which complexify the process to analyze them (and can be prone to security lapses).

First at all, email is described by mainly 3 RFCs:

  • [RFC822][rfc822]
  • [RFC2822][rfc2822]
  • [RFC5322][rfc5322]

Even if they keep compatibility together, an archeological work is needed to provide the most legacy way to parse an email. In fact, in some ways, some emails continue to respect old standards which did not realize (in 1970) bad or ugly designs.

The last RFC about email (RFC5322) try to fix them and provide a better [ABNF][abnf-rfc] to descrbe format - but of course, it comes with plenty of obsolete rules which need to be implemented. So, along the standard, you find grammar rule and its obsolete version.

A realistic email parser

Of course, at the end, to respect rules described by RFCs is not enough to be able to analyze any emails from the real world (from the true scope of the truth). Implementations about generation of email can, sometimes, produce wrong email. So mrmime is tested to parse a bunch of 2 billions emails to see if it can parse everything - even if it does not produce the expected result.

So, we updated, in some details, ABNF to be able to parse these bad emails when they appear multiple times.

An extended email parser

Of course, even if definition of the email can be done only by 3 RFCs, you will miss internationalization of mail ([RFC6532][rfc6532]), MIME format ([RFC2045][rfc2045], [RFC2046][rfc2046], [RFC2047][rfc2047], [RFC2049][rfc2049]), or details needed to be interoperable with SMTP ([RFC5321][rfc5321]) - or, again, some others RFCs which add some elements into an email like S/MIME or Content-Disposition field.

By this way, we took most general RFCs and try to provide an easy way to deal with them. Of course, the main difficulty is about the multipart parser (who tried to make an HTTP 1.1 parser knows about that).

A parser usable by others

One proof of concept of the usability of mrmime is ocaml-dkim which wants to extract a specific field from your mail and then verify if hash and signature correspond to what is expected.

ocaml-dkim is used with the last and new implementation of ocaml-dns to ask the public key to verify email.

An other point about ocaml-dkim and the most important is: it is able to verify your email in one pass. Indeed, currently some implementations of DKIM need 2 passes to verify your email (one to extract the DKIM signature, the other to digest some fields and bodies).

So we mostly focus on that to be able then to provide an unikernel which will be an SMTP relay and verify your received emails.

An email generator

OCaml is a good language to make a little DSL to serve our purpose. In this way, we took the advantage of OCaml to let the user to easily craft an email from nothing.

The idea is to make OCaml values and then, let the generator to make a stream and use it, for example, into a SMTP implementation.

This snippet show you how to make a little email header:

#require "mrmime" ;;
#require "ptime.clock.os" ;;

open Mrmime

let romain_calascibetta =
  let open Mailbox in
  Local.[ w "romain"; w "calascibetta" ] @ Domain.(domain, [ a "gmail"; a "com" ])

let john_doe =
  let open Mailbox in
  Local.[ w "john" ] @ Domain.(domain, [ a "doe"; a "org" ])
  |> with_name Phrase.(v [ w "John"; w "D." ])

let now () =
  let open Date in
  of_ptime ~zone:Zone.GMT (Ptime_clock.now ())

let subject =
  Unstructured.[ v "A"; sp 1; v "Simple"; sp 1; v "Mail" ]

let header =
  let open Header in
  Field.(Subject $ subject)
  & Field.(Sender $ romain_calascibetta)
  & Field.(To $ Address.[ mailbox john_doe ])
  & Field.(Date $ now ())
  & empty

let stream = Header.to_stream header

let () =
  let rec go () = match stream () with Some buf -> print_string buf ; go () | None -> () in go ()

And produce:

Date: 2 Aug 2019 14:10:10 GMT
To: John "D." <[email protected]>
Sender: [email protected]
Subject: A Simple Mail

78-characters rule

One aspect about email and SMTP is about some historical rules of how to generate them. One of them is about the limitation of bytes per line. Indeed, a generator of mail should emit at most 80 bytes per line - and, of course, it should emits entirely the email line per line.

So mrmime has his own encoder which tries to wrap your mail into this limit. It was mostly inspired by [Faraday][faraday] and [Format][format] powered with GADT to easily describe how to encode/generate parts of an email.

A multipart email generator

Of course, the main point about email is to be able to generate a multipart email - just to be able to send file-attachement. And, of course, a deep work was done about that to make parts, compose them into specific Content-Type fields and merge them into one email.

At the end, from it, you can easily make a stream which respects rules (78 bytes per line, stream line per line) and use it directly into an SMTP implementation.

This is what we did with the project [facteur][facteur]. It's a little command-line tool to send with file attachement mails in pure OCaml - but it works only on an UNIX operating system for instance.

Behind the forest

Even if you are able to parse and generate an email, we need to do some works before to give you results.

Indeed, email is a exchange unit between people and the biggest deal on that is to find a common way to ensure a understable communication each others. About that, encoding is probably the most important piece and when a French guy wants to communicate with a latin1 encoding, an American guy still uses ASCII.

Rosetta

So about this problem, the choice was made to unify any contents to UTF-8 as the most general encoding of the world. So, first, thanks to [@dbuenzli][dbuenzli] about [uutf][uutf] and then, we did some libraries which map an encoding flow to Unicode code-point. Then, we use uutf to normalize it to UTF-8.

The main goal it's to avoid an headache to the user about that and even if contents of the mail is encoded with latin1 we ensure to translate it correctly (and according RFCs) to UTF-8.

This project is [rosetta][rosetta] and it comes with:

  • [uuuu][uuuu] about ISO-8859 encoding
  • [coin][coin] about KOI8-{R,U} encoding
  • [yuscii][yuscii] about UTF-7 encoding

Pecu and Base64

Then, bodies can be encoded in some ways, 2 precisely:

  • A base64 encoding, used to store your file
  • A quoted-printable encoding

So, about the base64 package, it comes with a sub-package base64.rfc2045 which respects the special case to encode a body according RFC2045 and SMTP limitation.

Then, pecu was made to encode and decode quoted-printable contents. It was tested and fuzzed of course like any others MirageOS's libraries.

These libraries are needed for an other historical reason which is: bytes used to store mail should use only 7 bits instead of 8 bits. This is the purpose of the base64 and the quoted-printable encoding which uses only 127 possibilities of a byte. Again, this limitation comes with SMTP protocol.

Conclusion

mrmime can be considered as a huge project when it try to parse and generate email according 50 years of usability, several RFCs and legacy rules. So, it still is an experimental project. We reach the first version of it because we are currently able to parse many mails and generate them then correctly.

Of course, a bug (a malformed mail, a server which does not respect standards or a bad use of our API) can appear easily where we did not test everything. But we have the conscious that it was the time to release it and let people to use it.

The best feedback about mrmime and the best improvement is you. So don't be afraid to use it and start to hack your emails with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment