Quick start to cleaning text data

What	How
Fix Unicode encoding errors	`import ftfy` `ftfy.fix_text('âœ” No problems')`
Remove specific words	`s = s.replace("foo", "")`
Remove punctuation	`import string` `s = s.translate(str.maketrans('', '', string.punctuation))`
Remove hyperlinks	`import re` `s = re.sub(r"https?://\S+", "", s)`
Remove numbers	`import re` `s = re.sub(r"\b[0-9]+\b\s*", "", s)`
Remove extra spaces, tabs, and line breaks	`s = " ".join(s.split())`
Lowercase	`s = s.lower()`

I can not tell you which steps to include or not. That is an empirical question based on the domain and the goals of the project.

I can not tell how to order the steps. One option is run the same cleaning code several times to make sure all "dirty" items are remove. I often run the same cleaning code before and after tokenization/lemmatization so "dirty" items created by tokenization or lemmatization are removed.

I suggest testing your cleaning code. I frame it as binary classificaiton problem, I want to avoid false positive (too much cleaning) and false negative (too little cleaning).

import string

def clean_text(s: str) -> str:
    s = s.translate(str.maketrans('', '', string.punctuation))
    s = " ".join(s.split())
    s = s.lower()
    return s

assert clean_text("     ' #HELLO!!!! \n   ''") == 'hello'
assert clean_text("     ’ #HELLO!!!! \n   ’")  == '’ hello ’' # Other punctuation-like characters that are not removed

Many of these options are string methods which are relatively slow.

brianspiering/clean_test.md

Quick start to cleaning text data