Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save mmcclimon/144ee8b05fc512be9ce33a80305ce4ad to your computer and use it in GitHub Desktop.
Save mmcclimon/144ee8b05fc512be9ce33a80305ce4ad to your computer and use it in GitHub Desktop.
commit 1ab441c2e580e7498617dedcff90df674743e3e5
Date: 2017-04-11 15:13:15 -0400
fix handling of non-ASCII whitespace in newline munging
Cyrus will only accept mail with CRLF line endings, as is right and
proper for a network service. Meanwhile, postfix sends mail to pipe(8)
transports with LF only, as is right and proper for a unix text pipe.
To translate that, we start by converting incoming newlines from
"generic newline" to CRLF. Email::MIME works with either one, and this
way we're sure that Cyrus gets what it wants. This can be written as:
s/\x0d\x0a|\x0a/\x0d\x0a/g
Perl has a "generic newline" atom, though, which is what we used:
s/\R/\x0d\x0a/g
...which works just fine! \R matches CRLF or CR or LF, which was
sufficient for our needs. So, what's the problem? Well, I sent a
message to a Topicbox group which mentioned my learned colleague
Rob N. ★, and the message got stuck in queue. `ingress` choked on it:
The sequence 0xE2 could not be decoded into Unicode. Whaaa? The
message in question was "text/plain; charset=utf-8" and content encoding
was "8bit"! mutt was sending 8bit and ingress was choking on broken
UTF-8.
What's the connection here? Well:
~$ uni -8 ★
★ - U+02605 - E2 98 85 - BLACK STAR
The star in Rob's name UTF-8 encodes to the three octets above. The
s/// pattern was being applied to the octets, not any decoded form,
because email encoding is too complex to just decode in situ. That
meant it treated each UTF-8 octet as a Unicode character! You might
think this doesn't matter, because CR and LF are identical in UTF-8 and
Unicode, but *in fact*, \R also matches every vertical whitespace
character.
So, what characters was \R matching against?
~$ uniprops u+00e2
U+00E2 ‹â› \N{LATIN SMALL LETTER A WITH CIRCUMFLEX}
\w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
~$ uniprops u+0098
U+0098 ‹U+0098› \N{START OF STRING}
\pC \p{Cc}
~$ uniprops u+0085
U+0085 ‹U+0085› \N{NEXT LINE}
\s \v \R \pC \p{Cc}
The third octet in the UTF-8 encoding of ★ is 0x85. If interpreted as
U+0085, it's NEXT LINE, a vertical whitespace character, and it was
being turned into CRLF.
With a stricter newline normalizer, everything works again. We only
detected this problem because Rob N. ★ uses a non-ASCII character whose
UTF-8 encoding has an octet whose value corresponds to a vertical
whitespace character in the Unicode character set. Amazing!
diff --git a/bin/bounce-ingress b/bin/bounce-ingress
index f69d4b89..a9a71570 100755
--- a/bin/bounce-ingress
+++ b/bin/bounce-ingress
@@ -43,7 +43,7 @@ my ($opt, $usage) = describe_options(
);
my $input = do { local $/; <STDIN> };
-$input =~ s/\R/\r\n/g; # because Cyrus won't store messages with just \n
+$input =~ s/\x0d\x0a|\x0a/\x0d\x0a/g; # because Cyrus won't store messages with just \n
my $topicbox = Topicbox::JMAP::Processor->from_config($opt->config);
diff --git a/bin/ingress b/bin/ingress
index 854f9fd6..fd78b56f 100755
--- a/bin/ingress
+++ b/bin/ingress
@@ -44,7 +44,7 @@ my ($opt, $usage) = describe_options(
);
my $input = do { local $/; <STDIN> };
-$input =~ s/\R/\r\n/g; # because Cyrus won't store messages with just \n
+$input =~ s/\x0d\x0a|\x0a/\x0d\x0a/g; # because Cyrus won't store messages with just \n
my $topicbox = Topicbox::JMAP::Processor->from_config($opt->config);
diff --git a/lib/Topicbox/ImportHelper.pm b/lib/Topicbox/ImportHelper.pm
index 5006d31c..8e3b64b9 100644
--- a/lib/Topicbox/ImportHelper.pm
+++ b/lib/Topicbox/ImportHelper.pm
@@ -355,7 +355,7 @@ sub _read_mbox_batch ($self, $mbox, $n) {
while (my $email_in = $mbox->next_message) {
my $text = $email_in->as_string;
- $text =~ s/\R/\r\n/g;
+ $input =~ s/\x0d\x0a|\x0a/\x0d\x0a/g;
my $email = Email::MIME->new($text);
push @messages, $email;
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment