Created
August 20, 2024 17:26
-
-
Save dyerrington/7928f9371a124cdabb44eaa0c3638d31 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import regex | |
import unicodedata | |
# Precompile the regex pattern for removing unwanted characters (do this outside of any iteration since it's an expensive operation) | |
remove_pattern = regex.compile(r'[\p{P}\p{S}\p{M}\p{C}\p{Z}]+', regex.UNICODE) | |
def clean_unicode_text(text): | |
# Normalize the Unicode text | |
normalized_text = unicodedata.normalize('NFKD', text) | |
# Remove unwanted characters and trim the result | |
cleaned_text = remove_pattern.sub(' ', normalized_text).strip() | |
return cleaned_text |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment