Skip to content

Instantly share code, notes, and snippets.

@codeboy
Last active July 29, 2021 01:36
Show Gist options
  • Save codeboy/5487eeb1c551d59e2366 to your computer and use it in GitHub Desktop.
Save codeboy/5487eeb1c551d59e2366 to your computer and use it in GitHub Desktop.
Remove HTML tags + entities from string in Python +Django
import re
from django.utils.html import strip_tags
def parse_text(text, patterns=None):
"""
delete all HTML tags and entities
:param text (str): given text
:param patterns (dict): patterns for re.sub
:return str: final text
usage like:
parse_text('<div class="super"><p>Hello&ldquo;&rdquo;!&nbsp;&nbsp;</p>&lsquo;</div>')
>>> Hello!
"""
base_patterns = {
'&[rl]dquo;': '',
'&[rl]squo;': '',
'&nbsp;': ''
}
patterns = patterns or base_patterns
final_text = strip_tags(text)
for pattern, repl in patterns.items():
final_text = re.sub(pattern, repl, final_text)
return final_text
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment