mmcclimon · May 17, 2021 18:54
diff --git a/gistfile1.txt b/gistfile1.txt
 commit 1ab441c2e580e7498617dedcff90df674743e3e5
 Date:   2017-04-11 15:13:15 -0400

    fix handling of non-ASCII whitespace in newline munging
    
    Cyrus will only accept mail with CRLF line endings, as is right and
    proper for a network service.  Meanwhile, postfix sends mail to pipe(8)
    transports with LF only, as is right and proper for a unix text pipe.
    
    To translate that, we start by converting incoming newlines from
    "generic newline" to CRLF.  Email::MIME works with either one, and this
    way we're sure that Cyrus gets what it wants.  This can be written as:
    
      s/\x0d\x0a|\x0a/\x0d\x0a/g
    
    Perl has a "generic newline" atom, though, which is what we used:
    
      s/\R/\x0d\x0a/g
    
    ...which works just fine!  \R matches CRLF or CR or LF, which was
    sufficient for our needs.  So, what's the problem?  Well, I sent a
    message to a Topicbox group which mentioned my learned colleague
    Rob N. ★, and the message got stuck in queue.  `ingress` choked on it:
    The sequence 0xE2 could not be decoded into Unicode.  Whaaa?  The
    message in question was "text/plain; charset=utf-8" and content encoding
    was "8bit"!  mutt was sending 8bit and ingress was choking on broken
    UTF-8.
    
    What's the connection here?  Well:
    
      ~$ uni -8 ★
      ★ - U+02605 - E2 98 85 - BLACK STAR
    
    The star in Rob's name UTF-8 encodes to the three octets above.  The
    s/// pattern was being applied to the octets, not any decoded form,
    because email encoding is too complex to just decode in situ.  That
    meant it treated each UTF-8 octet as a Unicode character!  You might
    think this doesn't matter, because CR and LF are identical in UTF-8 and
    Unicode, but *in fact*, \R also matches every vertical whitespace
    character.
    
    So, what characters was \R matching against?
    
      ~$ uniprops u+00e2
      U+00E2 ‹â› \N{LATIN SMALL LETTER A WITH CIRCUMFLEX}
          \w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
    
      ~$ uniprops u+0098
      U+0098 ‹U+0098› \N{START OF STRING}
          \pC \p{Cc}
    
      ~$ uniprops u+0085
      U+0085 ‹U+0085› \N{NEXT LINE}
          \s \v \R \pC \p{Cc}
    
    The third octet in the UTF-8 encoding of ★ is 0x85.  If interpreted as
    U+0085, it's NEXT LINE, a vertical whitespace character, and it was
    being turned into CRLF.
    
    With a stricter newline normalizer, everything works again.  We only
    detected this problem because Rob N. ★ uses a non-ASCII character whose
    UTF-8 encoding has an octet whose value corresponds to a vertical
    whitespace character in the Unicode character set.  Amazing!

 diff --git a/bin/bounce-ingress b/bin/bounce-ingress
 index f69d4b89..a9a71570 100755
 --- a/bin/bounce-ingress
 +++ b/bin/bounce-ingress
 @@ -43,7 +43,7 @@ my ($opt, $usage) = describe_options(
 );
 
 my $input = do { local $/; <STDIN> };
 -$input =~ s/\R/\r\n/g; # because Cyrus won't store messages with just \n
 +$input =~ s/\x0d\x0a|\x0a/\x0d\x0a/g; # because Cyrus won't store messages with just \n
 
 my $topicbox = Topicbox::JMAP::Processor->from_config($opt->config);
 
 diff --git a/bin/ingress b/bin/ingress
 index 854f9fd6..fd78b56f 100755
 --- a/bin/ingress
 +++ b/bin/ingress
 @@ -44,7 +44,7 @@ my ($opt, $usage) = describe_options(
 );
 
 my $input = do { local $/; <STDIN> };
 -$input =~ s/\R/\r\n/g; # because Cyrus won't store messages with just \n
 +$input =~ s/\x0d\x0a|\x0a/\x0d\x0a/g; # because Cyrus won't store messages with just \n
 
 my $topicbox = Topicbox::JMAP::Processor->from_config($opt->config);
 
 diff --git a/lib/Topicbox/ImportHelper.pm b/lib/Topicbox/ImportHelper.pm
 index 5006d31c..8e3b64b9 100644
 --- a/lib/Topicbox/ImportHelper.pm
 +++ b/lib/Topicbox/ImportHelper.pm
 @@ -355,7 +355,7 @@ sub _read_mbox_batch ($self, $mbox, $n) {
 
   while (my $email_in = $mbox->next_message) {
     my $text = $email_in->as_string;
 -    $text =~ s/\R/\r\n/g;
 +    $input =~ s/\x0d\x0a|\x0a/\x0d\x0a/g;
     my $email = Email::MIME->new($text);
     push @messages, $email;
	commit 1ab441c2e580e7498617dedcff90df674743e3e5
	Date: 2017-04-11 15:13:15 -0400

	fix handling of non-ASCII whitespace in newline munging

	Cyrus will only accept mail with CRLF line endings, as is right and
	proper for a network service. Meanwhile, postfix sends mail to pipe(8)
	transports with LF only, as is right and proper for a unix text pipe.

	To translate that, we start by converting incoming newlines from
	"generic newline" to CRLF. Email::MIME works with either one, and this
	way we're sure that Cyrus gets what it wants. This can be written as:

	s/\x0d\x0a\|\x0a/\x0d\x0a/g

	Perl has a "generic newline" atom, though, which is what we used:

	s/\R/\x0d\x0a/g

	...which works just fine! \R matches CRLF or CR or LF, which was
	sufficient for our needs. So, what's the problem? Well, I sent a
	message to a Topicbox group which mentioned my learned colleague
	Rob N. ★, and the message got stuck in queue. `ingress` choked on it:
	The sequence 0xE2 could not be decoded into Unicode. Whaaa? The
	message in question was "text/plain; charset=utf-8" and content encoding
	was "8bit"! mutt was sending 8bit and ingress was choking on broken
	UTF-8.

	What's the connection here? Well:

	~$ uni -8 ★
	★ - U+02605 - E2 98 85 - BLACK STAR

	The star in Rob's name UTF-8 encodes to the three octets above. The
	s/// pattern was being applied to the octets, not any decoded form,
	because email encoding is too complex to just decode in situ. That
	meant it treated each UTF-8 octet as a Unicode character! You might
	think this doesn't matter, because CR and LF are identical in UTF-8 and
	Unicode, but in fact, \R also matches every vertical whitespace
	character.

	So, what characters was \R matching against?

	~$ uniprops u+00e2
	U+00E2 ‹â› \N{LATIN SMALL LETTER A WITH CIRCUMFLEX}
	\w \pL \p{LC} \p{L_} \p{L&} \p{Ll}

	~$ uniprops u+0098
	U+0098 ‹U+0098› \N{START OF STRING}
	\pC \p{Cc}

	~$ uniprops u+0085
	U+0085 ‹U+0085› \N{NEXT LINE}
	\s \v \R \pC \p{Cc}

	The third octet in the UTF-8 encoding of ★ is 0x85. If interpreted as
	U+0085, it's NEXT LINE, a vertical whitespace character, and it was
	being turned into CRLF.

	With a stricter newline normalizer, everything works again. We only
	detected this problem because Rob N. ★ uses a non-ASCII character whose
	UTF-8 encoding has an octet whose value corresponds to a vertical
	whitespace character in the Unicode character set. Amazing!

	diff --git a/bin/bounce-ingress b/bin/bounce-ingress
	index f69d4b89..a9a71570 100755
	--- a/bin/bounce-ingress
	+++ b/bin/bounce-ingress
	@@ -43,7 +43,7 @@ my ($opt, $usage) = describe_options(
	);

	my $input = do { local $/; <STDIN> };
	-$input =~ s/\R/\r\n/g; # because Cyrus won't store messages with just \n
	+$input =~ s/\x0d\x0a\|\x0a/\x0d\x0a/g; # because Cyrus won't store messages with just \n

	my $topicbox = Topicbox::JMAP::Processor->from_config($opt->config);

	diff --git a/bin/ingress b/bin/ingress
	index 854f9fd6..fd78b56f 100755
	--- a/bin/ingress
	+++ b/bin/ingress
	@@ -44,7 +44,7 @@ my ($opt, $usage) = describe_options(
	);

	my $input = do { local $/; <STDIN> };
	-$input =~ s/\R/\r\n/g; # because Cyrus won't store messages with just \n
	+$input =~ s/\x0d\x0a\|\x0a/\x0d\x0a/g; # because Cyrus won't store messages with just \n

	my $topicbox = Topicbox::JMAP::Processor->from_config($opt->config);

	diff --git a/lib/Topicbox/ImportHelper.pm b/lib/Topicbox/ImportHelper.pm
	index 5006d31c..8e3b64b9 100644
	--- a/lib/Topicbox/ImportHelper.pm
	+++ b/lib/Topicbox/ImportHelper.pm
	@@ -355,7 +355,7 @@ sub _read_mbox_batch ($self, $mbox, $n) {

	while (my $email_in = $mbox->next_message) {
	my $text = $email_in->as_string;
	- $text =~ s/\R/\r\n/g;
	+ $input =~ s/\x0d\x0a\|\x0a/\x0d\x0a/g;
	my $email = Email::MIME->new($text);
	push @messages, $email;