Skip to content

Instantly share code, notes, and snippets.

@brianspiering
Last active November 6, 2021 13:54
Show Gist options
  • Save brianspiering/bea3df9f094a3a2e4e8bf7c7a19f2d1a to your computer and use it in GitHub Desktop.
Save brianspiering/bea3df9f094a3a2e4e8bf7c7a19f2d1a to your computer and use it in GitHub Desktop.
Quick start to cleaning text data

Quick start to cleaning text data

What How
Fix Unicode encoding errors import ftfy
ftfy.fix_text('✔ No problems')
Remove specific words s = s.replace("foo", "")
Remove punctuation import string
s = s.translate(str.maketrans('', '', string.punctuation))
Remove hyperlinks import re
s = re.sub(r"https?://\S+", "", s)
Remove numbers import re
s = re.sub(r"\b[0-9]+\b\s*", "", s)
Remove extra spaces, tabs, and line breaks s = " ".join(s.split())
Lowercase s = s.lower()

I can not tell you which steps to include or not. That is an empirical question based on the domain and the goals of the project.

I can not tell how to order the steps. One option is run the same cleaning code several times to make sure all "dirty" items are remove. I often run the same cleaning code before and after tokenization/lemmatization so "dirty" items created by tokenization or lemmatization are removed.

I suggest testing your cleaning code. I frame it as binary classificaiton problem, I want to avoid false positive (too much cleaning) and false negative (too little cleaning).

import string

def clean_text(s: str) -> str:
    s = s.translate(str.maketrans('', '', string.punctuation))
    s = " ".join(s.split())
    s = s.lower()
    return s

assert clean_text("     ' #HELLO!!!! \n   ''") == 'hello'
assert clean_text("     ’ #HELLO!!!! \n   ’")  == '’ hello ’' # Other punctuation-like characters that are not removed

Many of these options are string methods which are relatively slow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment