Created
May 17, 2021 18:54
-
-
Save mmcclimon/144ee8b05fc512be9ce33a80305ce4ad to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
commit 1ab441c2e580e7498617dedcff90df674743e3e5 | |
Date: 2017-04-11 15:13:15 -0400 | |
fix handling of non-ASCII whitespace in newline munging | |
Cyrus will only accept mail with CRLF line endings, as is right and | |
proper for a network service. Meanwhile, postfix sends mail to pipe(8) | |
transports with LF only, as is right and proper for a unix text pipe. | |
To translate that, we start by converting incoming newlines from | |
"generic newline" to CRLF. Email::MIME works with either one, and this | |
way we're sure that Cyrus gets what it wants. This can be written as: | |
s/\x0d\x0a|\x0a/\x0d\x0a/g | |
Perl has a "generic newline" atom, though, which is what we used: | |
s/\R/\x0d\x0a/g | |
...which works just fine! \R matches CRLF or CR or LF, which was | |
sufficient for our needs. So, what's the problem? Well, I sent a | |
message to a Topicbox group which mentioned my learned colleague | |
Rob N. ★, and the message got stuck in queue. `ingress` choked on it: | |
The sequence 0xE2 could not be decoded into Unicode. Whaaa? The | |
message in question was "text/plain; charset=utf-8" and content encoding | |
was "8bit"! mutt was sending 8bit and ingress was choking on broken | |
UTF-8. | |
What's the connection here? Well: | |
~$ uni -8 ★ | |
★ - U+02605 - E2 98 85 - BLACK STAR | |
The star in Rob's name UTF-8 encodes to the three octets above. The | |
s/// pattern was being applied to the octets, not any decoded form, | |
because email encoding is too complex to just decode in situ. That | |
meant it treated each UTF-8 octet as a Unicode character! You might | |
think this doesn't matter, because CR and LF are identical in UTF-8 and | |
Unicode, but *in fact*, \R also matches every vertical whitespace | |
character. | |
So, what characters was \R matching against? | |
~$ uniprops u+00e2 | |
U+00E2 ‹â› \N{LATIN SMALL LETTER A WITH CIRCUMFLEX} | |
\w \pL \p{LC} \p{L_} \p{L&} \p{Ll} | |
~$ uniprops u+0098 | |
U+0098 ‹U+0098› \N{START OF STRING} | |
\pC \p{Cc} | |
~$ uniprops u+0085 | |
U+0085 ‹U+0085› \N{NEXT LINE} | |
\s \v \R \pC \p{Cc} | |
The third octet in the UTF-8 encoding of ★ is 0x85. If interpreted as | |
U+0085, it's NEXT LINE, a vertical whitespace character, and it was | |
being turned into CRLF. | |
With a stricter newline normalizer, everything works again. We only | |
detected this problem because Rob N. ★ uses a non-ASCII character whose | |
UTF-8 encoding has an octet whose value corresponds to a vertical | |
whitespace character in the Unicode character set. Amazing! | |
diff --git a/bin/bounce-ingress b/bin/bounce-ingress | |
index f69d4b89..a9a71570 100755 | |
--- a/bin/bounce-ingress | |
+++ b/bin/bounce-ingress | |
@@ -43,7 +43,7 @@ my ($opt, $usage) = describe_options( | |
); | |
my $input = do { local $/; <STDIN> }; | |
-$input =~ s/\R/\r\n/g; # because Cyrus won't store messages with just \n | |
+$input =~ s/\x0d\x0a|\x0a/\x0d\x0a/g; # because Cyrus won't store messages with just \n | |
my $topicbox = Topicbox::JMAP::Processor->from_config($opt->config); | |
diff --git a/bin/ingress b/bin/ingress | |
index 854f9fd6..fd78b56f 100755 | |
--- a/bin/ingress | |
+++ b/bin/ingress | |
@@ -44,7 +44,7 @@ my ($opt, $usage) = describe_options( | |
); | |
my $input = do { local $/; <STDIN> }; | |
-$input =~ s/\R/\r\n/g; # because Cyrus won't store messages with just \n | |
+$input =~ s/\x0d\x0a|\x0a/\x0d\x0a/g; # because Cyrus won't store messages with just \n | |
my $topicbox = Topicbox::JMAP::Processor->from_config($opt->config); | |
diff --git a/lib/Topicbox/ImportHelper.pm b/lib/Topicbox/ImportHelper.pm | |
index 5006d31c..8e3b64b9 100644 | |
--- a/lib/Topicbox/ImportHelper.pm | |
+++ b/lib/Topicbox/ImportHelper.pm | |
@@ -355,7 +355,7 @@ sub _read_mbox_batch ($self, $mbox, $n) { | |
while (my $email_in = $mbox->next_message) { | |
my $text = $email_in->as_string; | |
- $text =~ s/\R/\r\n/g; | |
+ $input =~ s/\x0d\x0a|\x0a/\x0d\x0a/g; | |
my $email = Email::MIME->new($text); | |
push @messages, $email; | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment