With lxml 4.5.0
❯ python
Python 3.9.1 (default, Feb 5 2021, 17:04:50)
[Clang 12.0.0 (clang-1200.0.32.29)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from lxml import etree
>>> from io import StringIO
>>> etree.parse(StringIO('<h2>👺</h2>'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "src/lxml/etree.pyx", line 3519, in lxml.etree.parse
File "src/lxml/parser.pxi", line 1856, in lxml.etree._parseDocument
File "src/lxml/parser.pxi", line 1876, in lxml.etree._parseMemoryDocument
File "src/lxml/parser.pxi", line 1757, in lxml.etree._parseDoc
File "src/lxml/parser.pxi", line 1068, in lxml.etree._BaseParser._parseUnicodeDoc
File "src/lxml/parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
File "src/lxml/parser.pxi", line 711, in lxml.etree._handleParseResult
File "src/lxml/parser.pxi", line 640, in lxml.etree._raiseParseError
File "<string>", line 1
lxml.etree.XMLSyntaxError: Char 0x0 out of allowed range, line 1, column 2
>>>
with lxml 4.6.1
❯ python
Python 3.9.1 (default, Feb 5 2021, 17:04:50)
[Clang 12.0.0 (clang-1200.0.32.29)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from lxml import etree
>>> from io import StringIO
>>> etree.parse(StringIO('<h2>👺</h2>'))
<lxml.etree._ElementTree object at 0x10ea66f40>
>>>
They closed this bug, but I just encountered this myself and see this test snippet still fails as well.