Created
August 4, 2012 15:21
-
-
Save karlcow/3258330 to your computer and use it in GitHub Desktop.
Silly lxml bug in Python
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
>>>from lxml import etree | |
>>> xml = u'<?xml version="1.0" encoding="utf-8" ?><foo><bar/></foo>' | |
>>> etree.XML(xml) | |
Traceback (most recent call last): | |
File "<stdin>", line 1, in <module> | |
File "lxml.etree.pyx", line 2736, in lxml.etree.XML (src/lxml/lxml.etree.c:54437) | |
File "parser.pxi", line 1569, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:82685) | |
ValueError: Unicode strings with encoding declaration are not supported. | |
>>> etree.HTML(xml) | |
Traceback (most recent call last): | |
File "<stdin>", line 1, in <module> | |
File "lxml.etree.pyx", line 2708, in lxml.etree.HTML (src/lxml/lxml.etree.c:54160) | |
File "parser.pxi", line 1569, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:82685) | |
ValueError: Unicode strings with encoding declaration are not supported. | |
>>> lxml.etree.__version__ | |
u'2.3.3' | |
>>> xml = u"<foo><bar/></foo>" | |
>>> etree.HTML(xml) | |
<Element html at 0x105364870> | |
>>> etree.XML(xml) | |
<Element foo at 0x105395a00> |
it's a bug, I agree. and @kennethreitz doesn't seem to intend to fix his part either: https://github.com/kennethreitz/requests/issues/465
anyway, above lxml-bug is easily enough fixed with:
>>> from lxml import etree
>>> xml = u'<?xml version="1.0" encoding="utf-8" ?><foo><bar/></foo>'
>>> xml = bytes(bytearray(xml, encoding='utf-8')) # ADDENDUM OF THIS LINE (when unicode means utf-8, e.g. on Linux)
>>> etree.XML(xml)
<Element html at 0x5b44c90>
@kernc we won't fix it because it isn't a bug. If you're using requests to get this string then the following should always work:
import requests
from lxml import etree
r = requests.get('http://example.com')
elem = etree.XML(r.content)
If you instead use r.text
, that is when you'll run into problems. On the other hand, from this gist, it seems clear this is something with lxml and not requests. One call with a unicode string doesn't work, while a different does. And from the error and the discussion on LaunchPad, it seems like this intentional.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Posted a comment on lxml bug report https://bugs.launchpad.net/lxml/+bug/613302
Not sure why it was set as Won't Fix even with the given explanation.