Skip to content

Instantly share code, notes, and snippets.

@karlcow
Created August 4, 2012 15:21
Show Gist options
  • Save karlcow/3258330 to your computer and use it in GitHub Desktop.
Save karlcow/3258330 to your computer and use it in GitHub Desktop.
Silly lxml bug in Python
>>>from lxml import etree
>>> xml = u'<?xml version="1.0" encoding="utf-8" ?><foo><bar/></foo>'
>>> etree.XML(xml)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "lxml.etree.pyx", line 2736, in lxml.etree.XML (src/lxml/lxml.etree.c:54437)
File "parser.pxi", line 1569, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:82685)
ValueError: Unicode strings with encoding declaration are not supported.
>>> etree.HTML(xml)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "lxml.etree.pyx", line 2708, in lxml.etree.HTML (src/lxml/lxml.etree.c:54160)
File "parser.pxi", line 1569, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:82685)
ValueError: Unicode strings with encoding declaration are not supported.
>>> lxml.etree.__version__
u'2.3.3'
>>> xml = u"<foo><bar/></foo>"
>>> etree.HTML(xml)
<Element html at 0x105364870>
>>> etree.XML(xml)
<Element foo at 0x105395a00>
@karlcow
Copy link
Author

karlcow commented Aug 4, 2012

Posted a comment on lxml bug report https://bugs.launchpad.net/lxml/+bug/613302
Not sure why it was set as Won't Fix even with the given explanation.

@kernc
Copy link

kernc commented Feb 7, 2013

it's a bug, I agree. and @kennethreitz doesn't seem to intend to fix his part either: https://github.com/kennethreitz/requests/issues/465

anyway, above lxml-bug is easily enough fixed with:

>>> from lxml import etree
>>> xml = u'<?xml version="1.0" encoding="utf-8" ?><foo><bar/></foo>'
>>> xml = bytes(bytearray(xml, encoding='utf-8'))  # ADDENDUM OF THIS LINE (when unicode means utf-8, e.g. on Linux)
>>> etree.XML(xml)
<Element html at 0x5b44c90>

@sigmavirus24
Copy link

@kernc we won't fix it because it isn't a bug. If you're using requests to get this string then the following should always work:

import requests
from lxml import etree

r = requests.get('http://example.com')
elem = etree.XML(r.content)

If you instead use r.text, that is when you'll run into problems. On the other hand, from this gist, it seems clear this is something with lxml and not requests. One call with a unicode string doesn't work, while a different does. And from the error and the discussion on LaunchPad, it seems like this intentional.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment