Skip to content

Instantly share code, notes, and snippets.

@amrakm
Created April 29, 2022 11:34
Show Gist options
  • Save amrakm/ebe81a1c2c35cac042c4bc9eb99b89f0 to your computer and use it in GitHub Desktop.
Save amrakm/ebe81a1c2c35cac042c4bc9eb99b89f0 to your computer and use it in GitHub Desktop.
clean text from html tags
import re
def cleanhtml(raw_html):
#Some HTML texts can also contain entities that are not enclosed in brackets, such as '&nsbm'. If that is the case, then you might want to write the regex as
CLEANR = re.compile('<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});')
# as per recommendation from @freylis, compile once only
CLEANR = re.compile('<.*?>')
cleantext = re.sub(CLEANR, '', raw_html)
return cleantext
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment