Skip to content

Instantly share code, notes, and snippets.

@thibaut-d
thibaut-d / clean_corpus.R
Created January 14, 2022 22:21
Data cleaning function in R
clean_corpus = function(x){
# Replace redundant white spaces and line jumps such as \n
x = replace_white(x)
# Replace or remove non ASCII characters
x = replace_non_ascii(x)
# Replace contractions such as "you're" by expanded such as "you are"
x = replace_contraction(x)
# Replace elongations. Ex: "heyyyyy" is replaced by "Hey"
x = replace_word_elongation(x)
# Replace emoji by plain text