Skip to content

Instantly share code, notes, and snippets.

@karlcow
Created February 10, 2021 13:50
Show Gist options
  • Save karlcow/5c11c06fb0345ea02ad51e5f7e9a2d9f to your computer and use it in GitHub Desktop.
Save karlcow/5c11c06fb0345ea02ad51e5f7e9a2d9f to your computer and use it in GitHub Desktop.
lxml.etree.XMLSyntaxError: Char 0x0 out of allowed range

With lxml 4.5.0

❯ python
Python 3.9.1 (default, Feb  5 2021, 17:04:50) 
[Clang 12.0.0 (clang-1200.0.32.29)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from lxml import etree
>>> from io import StringIO
>>> etree.parse(StringIO('<h2>👺</h2>'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "src/lxml/etree.pyx", line 3519, in lxml.etree.parse
  File "src/lxml/parser.pxi", line 1856, in lxml.etree._parseDocument
  File "src/lxml/parser.pxi", line 1876, in lxml.etree._parseMemoryDocument
  File "src/lxml/parser.pxi", line 1757, in lxml.etree._parseDoc
  File "src/lxml/parser.pxi", line 1068, in lxml.etree._BaseParser._parseUnicodeDoc
  File "src/lxml/parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
  File "src/lxml/parser.pxi", line 711, in lxml.etree._handleParseResult
  File "src/lxml/parser.pxi", line 640, in lxml.etree._raiseParseError
  File "<string>", line 1
lxml.etree.XMLSyntaxError: Char 0x0 out of allowed range, line 1, column 2
>>> 

with lxml 4.6.1

❯ python                         
Python 3.9.1 (default, Feb  5 2021, 17:04:50) 
[Clang 12.0.0 (clang-1200.0.32.29)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from lxml import etree
>>> from io import StringIO
>>> etree.parse(StringIO('<h2>👺</h2>'))
<lxml.etree._ElementTree object at 0x10ea66f40>
>>>

This is bug 1902364

@reagle
Copy link

reagle commented Aug 4, 2021

They closed this bug, but I just encountered this myself and see this test snippet still fails as well.

@karlcow
Copy link
Author

karlcow commented Aug 4, 2021

@reagle

python 3.9.1 (default, Feb  5 2021, 17:04:50) 
[Clang 12.0.0 (clang-1200.0.32.29)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from lxml import etree
>>> from io import StringIO
>>> etree.parse(StringIO('<h2>👺</h2>'))
<lxml.etree._ElementTree object at 0x10327cec0>
>>> import lxml
>>> lxml.__version__
'4.6.1'

@reagle
Copy link

reagle commented Aug 4, 2021

@karlcow, I didn't realize it was you! :) On macOS 11.5 20G71 arm64, I'm running python '3.9.6/[Clang 12.0.5 (clang-1205.0.22.9)]' and lxml '4.6.3' and get the error unfortunately.

@karlcow
Copy link
Author

karlcow commented Aug 4, 2021

hehe. Long time no see.
ah interesting. I wonder if this is something else then, because everything you have is more recent than me.

>>> from lxml import html, etree
>>> import sys
>>> print("%-20s: %s" % ('Python', sys.version_info))
Python              : sys.version_info(major=3, minor=9, micro=1, releaselevel='final', serial=0)
>>> print("%-20s: %s" % ('lxml.etree', etree.LXML_VERSION))
lxml.etree          : (4, 6, 1, 0)
>>> print("%-20s: %s" % ('libxml used', etree.LIBXML_VERSION))
libxml used         : (2, 9, 10)
>>> print("%-20s: %s" % ('libxml compiled', etree.LIBXML_COMPILED_VERSION))
libxml compiled     : (2, 9, 10)
>>> print("%-20s: %s" % ('libxslt used', etree.LIBXSLT_VERSION))
libxslt used        : (1, 1, 34)
>>> print("%-20s: %s" % ('libxslt compiled', etree.LIBXSLT_COMPILED_VERSION))
libxslt compiled    : (1, 1, 34)
>>> import locale
>>> locale.getlocale()
('fr_FR', 'UTF-8')

hmm another test outside of python in the terminal, to see if it's an issue with libxml.
The error message comes from libxml.

echo '<h2>👺</h2>' > /tmp/xml_char.xml
xmllint --memory --html /tmp/xml_char.xml

And I get:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<h2>&#128122;</h2>
</body></html>

No error. this is the correct emoji.

so maybe it's something in Daniel Veillard's libxml2

xmllint --version
xmllint: using libxml version 20904
   compiled with: Threads Tree Output Push Reader Patterns Writer SAXv1 FTP HTTP DTDValid HTML Legacy C14N Catalog XPath XPointer XInclude ICU ISO8859X Unicode Regexps Automata Expr Schemas Schematron Modules Debug Zlib 

/usr/bin/xsltproc --version
Using libxml 20904, libxslt 10129 and libexslt 817
xsltproc was compiled against libxml 20904, libxslt 10129 and libexslt 817
libxslt 10129 was compiled against libxml 20904
libexslt 817 was compiled against libxml 20904

@reagle
Copy link

reagle commented Aug 5, 2021

Thank you for that. My xmllint and xsltproc on the commandline are the same and work correctly, I suspect the problem is that within python:

libxml used/compiled         : (2, 9, 4)
libxslt used/compiled        : (1, 1, 29)

I'm not sure how to untangle that yet given my use of homebrew and pip3. Perhaps its related to this:

╰─➤  brin libxml2                                                                             1 ↵
libxml2: stable 2.9.12 (bottled), HEAD [keg-only]
GNOME XML library
http://xmlsoft.org/
/opt/homebrew/Cellar/libxml2/2.9.12 (282 files, 11.3MB)
  Built from source on 2021-07-06 at 11:42:44
From: https://github.com/Homebrew/homebrew-core/blob/HEAD/Formula/libxml2.rb
License: MIT
==> Dependencies
Build: [email protected] ✔
Required: readline ✔
==> Options
--HEAD
	Install HEAD version
==> Caveats
libxml2 is keg-only, which means it was not symlinked into /opt/homebrew,
because macOS already provides this software and installing another version in
parallel can cause all kinds of trouble.

If you need to have libxml2 first in your PATH, run:
  echo 'export PATH="/opt/homebrew/opt/libxml2/bin:$PATH"' >> ~/.zshrc

For compilers to find libxml2 you may need to set:
  export LDFLAGS="-L/opt/homebrew/opt/libxml2/lib"
  export CPPFLAGS="-I/opt/homebrew/opt/libxml2/include"

For pkg-config to find libxml2 you may need to set:
  export PKG_CONFIG_PATH="/opt/homebrew/opt/libxml2/lib/pkgconfig"

@karlcow
Copy link
Author

karlcow commented Aug 5, 2021

❯ brew info libxml2
libxml2: stable 2.9.12 (bottled), HEAD [keg-only]
GNOME XML library
http://xmlsoft.org/
Not installed
From: https://github.com/Homebrew/homebrew-core/blob/HEAD/Formula/libxml2.rb
License: MIT
==> Dependencies
Build: [email protected] ✘
Required: readline ✔
==> Options
--HEAD
	Install HEAD version
==> Caveats
libxml2 is keg-only, which means it was not symlinked into /usr/local,
because macOS already provides this software and installing another version in
parallel can cause all kinds of trouble.

==> Analytics
install: 51,320 (30 days), 175,981 (90 days), 637,723 (365 days)
install-on-request: 30,204 (30 days), 99,449 (90 days), 373,165 (365 days)
build-error: 0 (30 days)

Ah indeed difference here.

You: Build: [email protected]
Me: Build: [email protected]

@karlcow
Copy link
Author

karlcow commented Aug 5, 2021

❯ which python3
/usr/local/bin/python3

❯ which -a python3
/usr/local/bin/python3
/Library/Frameworks/Python.framework/Versions/3.8/bin/python3
/usr/local/bin/python3
/usr/bin/python3

A bit of a mess too. :)

❯ /usr/local/bin/python3 -V
Python 3.9.1
❯ /Library/Frameworks/Python.framework/Versions/3.8/bin/python3 -V
Python 3.8.3
❯ /usr/bin/python3 -V
Python 3.8.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment