Last active
July 29, 2021 01:36
-
-
Save codeboy/5487eeb1c551d59e2366 to your computer and use it in GitHub Desktop.
Remove HTML tags + entities from string in Python +Django
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import re | |
from django.utils.html import strip_tags | |
def parse_text(text, patterns=None): | |
""" | |
delete all HTML tags and entities | |
:param text (str): given text | |
:param patterns (dict): patterns for re.sub | |
:return str: final text | |
usage like: | |
parse_text('<div class="super"><p>Hello“”! </p>‘</div>') | |
>>> Hello! | |
""" | |
base_patterns = { | |
'&[rl]dquo;': '', | |
'&[rl]squo;': '', | |
' ': '' | |
} | |
patterns = patterns or base_patterns | |
final_text = strip_tags(text) | |
for pattern, repl in patterns.items(): | |
final_text = re.sub(pattern, repl, final_text) | |
return final_text | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment