Last active
September 5, 2018 15:18
-
-
Save tcg/f02483c05c5958891a5735738e809901 to your computer and use it in GitHub Desktop.
Decoding HTML entities to UTF-8 text in Python 2
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
def html_entity_decoder(s): | |
""" | |
Decodes HTML entities like `&` and `“` into their | |
plain-text counterparts. | |
NOTE: This is Python2 specific, and will require changes when | |
porting to Python3. | |
Args: | |
s: string of of HTML to filter. | |
Returns: | |
The provided string, with decoded HTML entities where present. | |
""" | |
from HTMLParser import HTMLParser | |
p = HTMLParser() | |
# The HTMLParser instance has an "internal" method called `unescape` | |
# that will convert all HTML entities into their Unicode codepoint | |
# equivalent. | |
# The "decoded" characters it returns are ISO 8859-1 (Latin-1) characters, | |
# per the docs for `htmlentitydefs`, which is what the HTMLParser | |
# uses internally. | |
# See: https://github.com/python/cpython/blob/2.7/Lib/HTMLParser.py#L447 | |
# And: https://docs.python.org/2/library/htmllib.html#module-htmlentitydefs | |
# So here, we'll explicitly encode the returned string as | |
# UTF-8, before returning it: | |
return p.unescape(s).encode("utf-8") |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
References: