Skip to content

Instantly share code, notes, and snippets.

@tcg
Last active September 5, 2018 15:18
Show Gist options
  • Save tcg/f02483c05c5958891a5735738e809901 to your computer and use it in GitHub Desktop.
Save tcg/f02483c05c5958891a5735738e809901 to your computer and use it in GitHub Desktop.
Decoding HTML entities to UTF-8 text in Python 2
def html_entity_decoder(s):
"""
Decodes HTML entities like `&` and `“` into their
plain-text counterparts.
NOTE: This is Python2 specific, and will require changes when
porting to Python3.
Args:
s: string of of HTML to filter.
Returns:
The provided string, with decoded HTML entities where present.
"""
from HTMLParser import HTMLParser
p = HTMLParser()
# The HTMLParser instance has an "internal" method called `unescape`
# that will convert all HTML entities into their Unicode codepoint
# equivalent.
# The "decoded" characters it returns are ISO 8859-1 (Latin-1) characters,
# per the docs for `htmlentitydefs`, which is what the HTMLParser
# uses internally.
# See: https://github.com/python/cpython/blob/2.7/Lib/HTMLParser.py#L447
# And: https://docs.python.org/2/library/htmllib.html#module-htmlentitydefs
# So here, we'll explicitly encode the returned string as
# UTF-8, before returning it:
return p.unescape(s).encode("utf-8")