Created
August 5, 2017 23:31
-
-
Save kasei/85530ffa034a7318693579c586b18cec to your computer and use it in GitHub Desktop.
HTML::HTML5::Parser charset issue (debian Bug report #750946)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env perl | |
# Regarding https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=750946 | |
# There are at least two issues with the code in the bugreport. | |
# The first looks like a bug in HTML::HTML5::Parser (or its | |
# dependencies) that is mis-recognizing the charset of the file being | |
# opened. | |
# | |
# However, the code included in the bugreport also has a bug in it: | |
# even with a properly loaded $doc object (as in this case from a | |
# string literal), calling `print $doc->toString()` won't work as | |
# expected because it returns a byte string and STDOUT has been | |
# configured to utf8 encode all output. If STDOUT remains configured | |
# with the UTF-8 encoding layer, the bytes must be decoded to a | |
# character string before printing to STDOUT: | |
use strict; | |
use HTML::HTML5::Parser; | |
use Encode qw(encode_utf8 decode_utf8); | |
use utf8; # for the characters in the script. | |
binmode STDOUT, ':encoding(UTF-8)'; # for stdout. | |
my $parser = HTML::HTML5::Parser->new; | |
my $doc = $parser->parse_string(encode_utf8(<<"END")); | |
<?xml version="1.0" encoding="utf-8"?> | |
<html xmlns="http://www.w3.org/1999/xhtml"> | |
<head> | |
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> | |
<title>title</title> | |
</head> | |
<body> | |
<p>é↓</p> | |
</body> | |
</html> | |
END | |
print "Charset: '", $parser->charset($doc), "'\n"; | |
my $bytes = $doc->toString(); | |
my $str = decode_utf8($bytes); | |
print $str; |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I think this might fix the problem, but I'm not at all familiar with the HTML5 parser code. So while it still passes its test suite, I have no idea if this might break things.