Skip to content

Instantly share code, notes, and snippets.

@dyerrington
Created August 20, 2024 17:26
Show Gist options
  • Save dyerrington/7928f9371a124cdabb44eaa0c3638d31 to your computer and use it in GitHub Desktop.
Save dyerrington/7928f9371a124cdabb44eaa0c3638d31 to your computer and use it in GitHub Desktop.
import regex
import unicodedata
# Precompile the regex pattern for removing unwanted characters (do this outside of any iteration since it's an expensive operation)
remove_pattern = regex.compile(r'[\p{P}\p{S}\p{M}\p{C}\p{Z}]+', regex.UNICODE)
def clean_unicode_text(text):
# Normalize the Unicode text
normalized_text = unicodedata.normalize('NFKD', text)
# Remove unwanted characters and trim the result
cleaned_text = remove_pattern.sub(' ', normalized_text).strip()
return cleaned_text
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment