Skip to content

Instantly share code, notes, and snippets.

@chrisdone
Last active August 14, 2023 23:40
Show Gist options
  • Save chrisdone/47a9e22672b54dedc87dec8b415e8127 to your computer and use it in GitHub Desktop.
Save chrisdone/47a9e22672b54dedc87dec8b415e8127 to your computer and use it in GitHub Desktop.
Email message parsing in Haskell in 2018

Email message parsing in Haskell in 2018

Background: I am moving away from GMail to my own Haskell-based server (SMTP receiver/sender, web-based email client, spam filtering, etc.). All email to @chrisdone.com goes through this server (mx.chrisdone.com) as of today, and [email protected] forwards a copy of everything to it.

This is a summary/tracking document of my efforts to simply parse email messages in Haskell for the SMTP receiver.

The problem: There are many packages on Hackage capable of parsing some or all of an email message, but almost all of them are incomplete in some way (either they are too old using String everywhere, have encoding problems, are not streaming, or have bugs).

Without exception, they are all poorly documented.

Summary

  • If you're looking to reliably parse emails in Haskell today, you will be disappointed.
  • Apparently none of them have been using in a real setting.
  • You will have to make correctness vs performance trade-offs.

Ideally, they would all be deprecated, in favor of a package (or one of them) that looks like this:

  • Written in pure Haskell.
  • Is well documented!
  • Has a thorough test suite including full email samples from real servers.
  • Uses modern libraries (time, bytestring, text, attoparsec).
  • Uses attoparsec for:
    • Fast parsing.
    • Streaming parsing.
  • Handles multiparts properly.
  • Provides a SAX-style streaming interface, so that:
    • Message parts can be streamed to file, database, or network.
    • We can have conduit and pipes interfaces.
  • Uses ByteString for everything except where appropriate (e.g. a part which is known to have a text UTF-8 encoding can be decoded into Text).
  • Has a benchmark suite.

mime

http://hackage.haskell.org/package/mime

(c) 2006-2009 Galois Inc.

I am currently using this package.

Good

  • It can parse everything that I have received so far on my server after a week, from postfix, outlook and gmail servers, and mailing lists (mailman and the kernel), with multiple attachments.

Bad

  • It didn't handle \n as a line separator in QuotedPrintable. I know, this isn't standard, but I received an email like this from Haskell-Cafe. I patched it.
  • It has performance bugs. I found an O(n^2) time complexity bug in the normalizeCLRF function, which caused my server to spin at 100% for minutes at a time while receiving attachments, causing the mail to have to be re-sent for days. I fixed the bug by changing its output type to a Builder.
  • It has no test suite.
  • It was based on String, now it uses Text in a misguided attempt to add correctness. Unfortunately, the user of the library is forced to shoe horn binary data and non-UTF8 data into and out of a Text value. I've already had to work around bugs due to this.
  • It's not a streaming parser.

Example

> :!cat > in.txt
From: John Doe <example@example.com>
MIME-Version: 1.0
Content-Type: multipart/mixed;
        boundary="XXXXboundary text"

This is a multipart message in MIME format.

--XXXXboundary text
Content-Type: text/plain

this is the body text

--XXXXboundary text
Content-Type: text/plain;
Content-Disposition: attachment;
        filename="test.txt"

this is the attachment text

--XXXXboundary text--

> import qualified Data.Text.IO as T
> fmap parseMIMEMessage (T.readFile "in.txt")
MIMEValue {mime_val_type = Type {mimeType = Multipart Mixed, mimeParams = [MIMEParam {paramName = "boundary", paramValue = "XXXXboundary text"}]}, mime_val_disp = Nothing, mime_val_content = Multi [MIMEValue {mime_val_type = Type {mimeType = Text "plain", mimeParams = []}, mime_val_disp = Nothing, mime_val_content = Single "this is the body text\r\n", mime_val_headers = [MIMEParam {paramName = "content-type", paramValue = "text/plain"}], mime_val_inc_type = True},MIMEValue {mime_val_type = Type {mimeType = Text "plain", mimeParams = []}, mime_val_disp = Just (Disposition {dispType = DispAttachment, dispParams = [Filename "test.txt"]}), mime_val_content = Single "this is the attachment text\r\n", mime_val_headers = [MIMEParam {paramName = "content-type", paramValue = "text/plain;"},MIMEParam {paramName = "content-disposition", paramValue = "attachment;        filename=\"test.txt\""}], mime_val_inc_type = True}], mime_val_headers = [MIMEParam {paramName = "from", paramValue = "John Doe <[email protected]>"},MIMEParam {paramName = "mime-version", paramValue = "1.0"},MIMEParam {paramName = "content-type", paramValue = "multipart/mixed;        boundary=\"XXXXboundary text\""}], mime_val_inc_type = True}

hweblib

http://hackage.haskell.org/package/hweblib

Aycan iRiCAN

Good

  • It's a streaming parser.

Bad

  • It does not seem to actually parse messages:
    > fmap (parseOnly parseMimeHeaders) (S.readFile "/tmp/gmail.txt")
    Right (MimeValue {mvType = Type {mimeType = Text "plain", mimeParams = fromList [("charset","us-ascii")]}, mvDisp = Nothing, mvContent = Multi [], mvHeaders = fromList [], mvIncType = True})
    > fmap (parseOnly parseMimeHeaders) (S.readFile "/tmp/gmail-attachment.txt")
    Right (MimeValue {mvType = Type {mimeType = Text "plain", mimeParams = fromList [("charset","us-ascii")]}, mvDisp = Nothing, mvContent = Multi [], mvHeaders = fromList [], mvIncType = True})

Same example from above:

> :!cat > in.txt
From: John Doe <example@example.com>
MIME-Version: 1.0
Content-Type: multipart/mixed;
        boundary="XXXXboundary text"

This is a multipart message in MIME format.

--XXXXboundary text
Content-Type: text/plain

this is the body text

--XXXXboundary text
Content-Type: text/plain;
Content-Disposition: attachment;
        filename="test.txt"

this is the attachment text

--XXXXboundary text--

> fmap (parseOnly parseMimeHeaders) (S.readFile "in.txt")
Left "string"

email-header

http://hackage.haskell.org/package/email-header

2014-2018 Kyle Raftogianis

Good

  • It's a streaming parser.

Bad

  • It doesn't actually parse a list of headers, just individual header values from a list of [(CI ByteString, ByteString)].

hsemail

http://hackage.haskell.org/package/hsemail

Peter Simons, Ali Abrar, Gero Kriependorf, Marty Pauley

Good

  • It properly parses emails that I received.
  • You can parse from a ByteString.

Bad

  • It uses the archaic old-time package, so you have to convert all the times to the modern time package.
  • It uses String everywhere.
  • It does not parse the MIME bodies, so one has to manually handle multipart messages; which I did for a while, before switching to the mime package.
  • It's using parsec which is not streaming or efficient, for handling megabytes of traffic.

Example

Prelude Text.Parsec.Rfc2822 Text.Parsec S> fmap (parse message "") (S.readFile "in.txt")
Right
  (Message
     [ From
         [ NameAddr
             { nameAddr_name = Just "John Doe"
             , nameAddr_addr = "[email protected]"
             }
         ]
     , OptionalField "MIME-Version" " 1.0"
     , OptionalField
         "Content-Type"
         " multipart/mixed;\r\n        boundary=\"XXXXboundary text\""
     ]
     "This is a multipart message in MIME format.\r\n\r\n--XXXXboundary text\r\nContent-Type: text/html\r\n\r\nthis is the <b>body</b> text\r\n\r\n--XXXXboundary text\r\nContent-Type: text/plain;\r\nContent-Disposition: attachment;\r\n        filename=\"test.txt\"\r\n\r\nthis is the attachment text\r\n\r\n--XXXXboundary text--\r\n\n\n")

mime-string

http://hackage.haskell.org/package/mime-string

Ian Lynagh

Bad

  • String-based.
  • Not streaming.

emailparse

http://hackage.haskell.org/package/emailparse

Michal Kawalec [email protected]

I tested this out on my server for a little while.

Good

  • It's based on attoparsec, so streaming.

Bad

  • Messages (multipart) are yielded as a tree, not streaming; so one cannot write parts to disk/DB in a streaming fashion. All of a 10MB email would have to be loaded into memory.
  • It does not handle nested multipart messages. It handles one level of nesting. But it's common for mesages to be e.g. of this nesting form: [[html,text],[attachment]]
  • Doesn't build; depends on a C library for decoding base64.

Example

Problem:

        , emailBodies =
            [ MessageBody
                (EmailMessage
                   { emailHeaders =
                       [ Header
                           { headerName = "Content-Type"
                           , headerContents =
                               "multipart/alternative; boundary=\"000000000000a2a84f05715ab8ef\""
                           }
                       ]
                   , emailBodies =
                       [ TextBody
                           "--000000000000a2a84f05715ab8ef\r\nContent-Type: text/plain; charset=\"UTF-8\"\r\n\r\nHere's a smaller file\r\n\r\n--000000000000a2a84f05715ab8ef\r\nContent-Type: text/html; charset=\"UTF-8\"\r\n\r\n<div dir=\"ltr\">Here&#39;s a smaller file</div>\r\n\r\n--000000000000a2a84f05715ab8ef--\r\n"
                       ]
                   })
            ]
@koflerdavid
Copy link

Messages (multipart) are yielded as a tree, not streaming; so one cannot write parts to disk/DB in a streaming fashion. All of a 10MB email would have to be loaded into memory.

Shouldn't that be listed under "Bad"?

@tko
Copy link

tko commented Aug 3, 2018

FWIW you may want to look at https://github.com/jstedfast/gmime for edge cases to handle. Not affiliated, I just recall reading a post (I can't find anymore) how email/mime parsing is a minefield.

@romanofski
Copy link

romanofski commented Aug 3, 2018

We're currently working on a new library over at https://github.com/purebred-mua/purebred-email in order to use it in purebred (text based MUA which is in development). It will not tick all boxes, since it's work-in-progress, but might be good enough to draw some help to finish it?
Update: Just in case it's not clear: fast parsing is implemented and can already be used.

@chrisdone
Copy link
Author

Right, it should've been bad. I'll fix it, thanks @koflerdavid.

@chrisdone
Copy link
Author

Thanks @romanofski, I'll check it out next time I come back to parsing. 👍

@mkawalec
Copy link

mkawalec commented Aug 7, 2018

I should actually finish emailparse - I wrote it as a PoC some time ago and never got to actually properly cleaning it up. Would you need anything else besides fixing the bad parts @chrisdone?

@frasertweedale
Copy link

@chrisdone FYI purebred-email is now on Hackage: https://hackage.haskell.org/package/purebred-email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment