Skip to content

Instantly share code, notes, and snippets.

@pyrocat101
Created June 1, 2013 10:25
Show Gist options
  • Save pyrocat101/5689932 to your computer and use it in GitHub Desktop.
Save pyrocat101/5689932 to your computer and use it in GitHub Desktop.
Unescape HTML entities to str (not unicode!)
import re
import htmlentitydefs
def unescape(text):
def fixup(m):
text = m.group(0)
if text[:2] == "&#":
# character reference
try:
if text[:3] == "&#x":
return chr(int(text[3:-1], 16))
else:
return chr(int(text[2:-1]))
except ValueError:
pass
else:
# named entity
try:
text = chr(htmlentitydefs.name2codepoint[text[1:-1]])
except KeyError:
pass
return text # leave as is
return re.sub("&#?\w+;", fixup, text)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment