Background: I am moving away from GMail to my own Haskell-based
server (SMTP receiver/sender, web-based email client, spam filtering,
etc.). All email to @chrisdone.com
goes through this server
(mx.chrisdone.com
) as of today, and [email protected]
forwards a
copy of everything to it.
This is a summary/tracking document of my efforts to simply parse email messages in Haskell for the SMTP receiver.
The problem: There are many packages on Hackage capable of parsing some or all of an email message, but almost all of them are incomplete in some way (either they are too old using String everywhere, have encoding problems, are not streaming, or have bugs).
Without exception, they are all poorly documented.
- If you're looking to reliably parse emails in Haskell today, you will be disappointed.
- Apparently none of them have been using in a real setting.
- You will have to make correctness vs performance trade-offs.
Ideally, they would all be deprecated, in favor of a package (or one of them) that looks like this:
- Written in pure Haskell.
- Is well documented!
- Has a thorough test suite including full email samples from real servers.
- Uses modern libraries (time, bytestring, text, attoparsec).
- Uses attoparsec for:
- Fast parsing.
- Streaming parsing.
- Handles multiparts properly.
- Provides a SAX-style streaming interface, so that:
- Message parts can be streamed to file, database, or network.
- We can have conduit and pipes interfaces.
- Uses ByteString for everything except where appropriate (e.g. a part which is known to have a text UTF-8 encoding can be decoded into Text).
- Has a benchmark suite.
http://hackage.haskell.org/package/mime
(c) 2006-2009 Galois Inc.
I am currently using this package.
Good
- It can parse everything that I have received so far on my server after a week, from postfix, outlook and gmail servers, and mailing lists (mailman and the kernel), with multiple attachments.
Bad
- It didn't handle \n as a line separator in QuotedPrintable. I know, this isn't standard, but I received an email like this from Haskell-Cafe. I patched it.
- It has performance bugs. I found an O(n^2) time complexity bug
in the normalizeCLRF function, which caused my server to spin at
100% for minutes at a time while receiving attachments, causing the
mail to have to be re-sent for days. I
fixed the bug
by changing its output type to a
Builder
. - It has no test suite.
- It was based on
String
, now it usesText
in a misguided attempt to add correctness. Unfortunately, the user of the library is forced to shoe horn binary data and non-UTF8 data into and out of aText
value. I've already had to work around bugs due to this. - It's not a streaming parser.
Example
> :!cat > in.txt
From: John Doe <example@example.com>
MIME-Version: 1.0
Content-Type: multipart/mixed;
boundary="XXXXboundary text"
This is a multipart message in MIME format.
--XXXXboundary text
Content-Type: text/plain
this is the body text
--XXXXboundary text
Content-Type: text/plain;
Content-Disposition: attachment;
filename="test.txt"
this is the attachment text
--XXXXboundary text--
> import qualified Data.Text.IO as T
> fmap parseMIMEMessage (T.readFile "in.txt")
MIMEValue {mime_val_type = Type {mimeType = Multipart Mixed, mimeParams = [MIMEParam {paramName = "boundary", paramValue = "XXXXboundary text"}]}, mime_val_disp = Nothing, mime_val_content = Multi [MIMEValue {mime_val_type = Type {mimeType = Text "plain", mimeParams = []}, mime_val_disp = Nothing, mime_val_content = Single "this is the body text\r\n", mime_val_headers = [MIMEParam {paramName = "content-type", paramValue = "text/plain"}], mime_val_inc_type = True},MIMEValue {mime_val_type = Type {mimeType = Text "plain", mimeParams = []}, mime_val_disp = Just (Disposition {dispType = DispAttachment, dispParams = [Filename "test.txt"]}), mime_val_content = Single "this is the attachment text\r\n", mime_val_headers = [MIMEParam {paramName = "content-type", paramValue = "text/plain;"},MIMEParam {paramName = "content-disposition", paramValue = "attachment; filename=\"test.txt\""}], mime_val_inc_type = True}], mime_val_headers = [MIMEParam {paramName = "from", paramValue = "John Doe <[email protected]>"},MIMEParam {paramName = "mime-version", paramValue = "1.0"},MIMEParam {paramName = "content-type", paramValue = "multipart/mixed; boundary=\"XXXXboundary text\""}], mime_val_inc_type = True}
http://hackage.haskell.org/package/hweblib
Aycan iRiCAN
Good
- It's a streaming parser.
Bad
- It does not seem to actually parse messages:
> fmap (parseOnly parseMimeHeaders) (S.readFile "/tmp/gmail.txt")
Right (MimeValue {mvType = Type {mimeType = Text "plain", mimeParams = fromList [("charset","us-ascii")]}, mvDisp = Nothing, mvContent = Multi [], mvHeaders = fromList [], mvIncType = True})
> fmap (parseOnly parseMimeHeaders) (S.readFile "/tmp/gmail-attachment.txt")
Right (MimeValue {mvType = Type {mimeType = Text "plain", mimeParams = fromList [("charset","us-ascii")]}, mvDisp = Nothing, mvContent = Multi [], mvHeaders = fromList [], mvIncType = True})
Same example from above:
> :!cat > in.txt
From: John Doe <example@example.com>
MIME-Version: 1.0
Content-Type: multipart/mixed;
boundary="XXXXboundary text"
This is a multipart message in MIME format.
--XXXXboundary text
Content-Type: text/plain
this is the body text
--XXXXboundary text
Content-Type: text/plain;
Content-Disposition: attachment;
filename="test.txt"
this is the attachment text
--XXXXboundary text--
> fmap (parseOnly parseMimeHeaders) (S.readFile "in.txt")
Left "string"
http://hackage.haskell.org/package/email-header
2014-2018 Kyle Raftogianis
Good
- It's a streaming parser.
Bad
- It doesn't actually parse a list of headers, just individual header
values from a list of
[(CI ByteString, ByteString)]
.
http://hackage.haskell.org/package/hsemail
Peter Simons, Ali Abrar, Gero Kriependorf, Marty Pauley
Good
- It properly parses emails that I received.
- You can parse from a
ByteString
.
Bad
- It uses the archaic
old-time
package, so you have to convert all the times to the moderntime
package. - It uses
String
everywhere. - It does not parse the MIME bodies, so one has to manually handle
multipart messages; which I did for a while, before switching to the
mime
package. - It's using parsec which is not streaming or efficient, for handling megabytes of traffic.
Example
Prelude Text.Parsec.Rfc2822 Text.Parsec S> fmap (parse message "") (S.readFile "in.txt")
Right
(Message
[ From
[ NameAddr
{ nameAddr_name = Just "John Doe"
, nameAddr_addr = "[email protected]"
}
]
, OptionalField "MIME-Version" " 1.0"
, OptionalField
"Content-Type"
" multipart/mixed;\r\n boundary=\"XXXXboundary text\""
]
"This is a multipart message in MIME format.\r\n\r\n--XXXXboundary text\r\nContent-Type: text/html\r\n\r\nthis is the <b>body</b> text\r\n\r\n--XXXXboundary text\r\nContent-Type: text/plain;\r\nContent-Disposition: attachment;\r\n filename=\"test.txt\"\r\n\r\nthis is the attachment text\r\n\r\n--XXXXboundary text--\r\n\n\n")
http://hackage.haskell.org/package/mime-string
Ian Lynagh
Bad
- String-based.
- Not streaming.
http://hackage.haskell.org/package/emailparse
Michal Kawalec [email protected]
I tested this out on my server for a little while.
Good
- It's based on attoparsec, so streaming.
Bad
- Messages (multipart) are yielded as a tree, not streaming; so one cannot write parts to disk/DB in a streaming fashion. All of a 10MB email would have to be loaded into memory.
- It does not handle nested multipart messages. It handles one level
of nesting. But it's common for mesages to be e.g. of this nesting
form:
[[html,text],[attachment]]
- Doesn't build; depends on a C library for decoding base64.
Example
Problem:
, emailBodies =
[ MessageBody
(EmailMessage
{ emailHeaders =
[ Header
{ headerName = "Content-Type"
, headerContents =
"multipart/alternative; boundary=\"000000000000a2a84f05715ab8ef\""
}
]
, emailBodies =
[ TextBody
"--000000000000a2a84f05715ab8ef\r\nContent-Type: text/plain; charset=\"UTF-8\"\r\n\r\nHere's a smaller file\r\n\r\n--000000000000a2a84f05715ab8ef\r\nContent-Type: text/html; charset=\"UTF-8\"\r\n\r\n<div dir=\"ltr\">Here's a smaller file</div>\r\n\r\n--000000000000a2a84f05715ab8ef--\r\n"
]
})
]
We're currently working on a new library over at https://github.com/purebred-mua/purebred-email in order to use it in purebred (text based MUA which is in development). It will not tick all boxes, since it's work-in-progress, but might be good enough to draw some help to finish it?
Update: Just in case it's not clear: fast parsing is implemented and can already be used.