What | How |
---|---|
Fix Unicode encoding errors | import ftfy ftfy.fix_text('✔ No problems') |
Remove specific words | s = s.replace("foo", "") |
Remove punctuation | import string s = s.translate(str.maketrans('', '', string.punctuation)) |
Remove hyperlinks | import re s = re.sub(r"https?://\S+", "", s) |
Remove numbers | import re s = re.sub(r"\b[0-9]+\b\s*", "", s) |
Remove extra spaces, tabs, and line breaks | s = " ".join(s.split()) |
Lowercase | s = s.lower() |
I can not tell you which steps to include or not. That is an empirical question based on the domain and the goals of the project.
I can not tell how to order the steps. One option is run the same cleaning code several times to make sure all "dirty" items are remove. I often run the same cleaning code before and after tokenization/lemmatization so "dirty" items created by tokenization or lemmatization are removed.
I suggest testing your cleaning code. I frame it as binary classificaiton problem, I want to avoid false positive (too much cleaning) and false negative (too little cleaning).
import string
def clean_text(s: str) -> str:
s = s.translate(str.maketrans('', '', string.punctuation))
s = " ".join(s.split())
s = s.lower()
return s
assert clean_text(" ' #HELLO!!!! \n ''") == 'hello'
assert clean_text(" ’ #HELLO!!!! \n ’") == '’ hello ’' # Other punctuation-like characters that are not removed
Many of these options are string methods which are relatively slow.