Skip to content

Instantly share code, notes, and snippets.

@LouisdeBruijn
Last active December 31, 2020 10:22
Show Gist options
  • Select an option

  • Save LouisdeBruijn/1cc8403a76dae8b2f3e25e95fdd370ac to your computer and use it in GitHub Desktop.

Select an option

Save LouisdeBruijn/1cc8403a76dae8b2f3e25e95fdd370ac to your computer and use it in GitHub Desktop.
import html
def unescape_html(
text: str) -> str:
"""Converts any HTML entities found in text to their textual representation.
:param text: utterance that may contain HTML entities
:type text: str
Example of HTML entities found during annotations
html_entities = [(" ", " ")
, ("&", "&")
, (">", ">")
, ("&lt;", "<")
, ("&le;", "≤")
, ("&ge;", "≥")]
:return: utterance wihtout HTML entities
:rtype: str
"""
return html.unescape(text)
s = "Ik wil de te naamstelling van &nbsp; mijn betaalrekening &amp; pas aanpassen Mej. \u2014-&gt; Mw."
json_dumped_s = json.dumps(unescape_html(s))
print(json_dumped_s)
>>> "Ik wil de te naamstelling van \u00a0 mijn betaalrekening & pas aanpassen Mej. \u2014-> Mw."
print(json.loads(json_dumped_s))
>>> Ik wil de te naamstelling van   mijn betaalrekening & pas aanpassen Mej. —-> Mw.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment