Created
November 23, 2021 12:11
-
-
Save andrea-dagostino/4235ac562eacdac7fbf5505c4018b3e8 to your computer and use it in GitHub Desktop.
posts/raggruppamento-testuale-con-tf-idf
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
def preprocess_text(text: str, remove_stopwords: bool) -> str: | |
"""This utility function sanitizes a string by: | |
- removing links | |
- removing special characters | |
- removing numbers | |
- removing stopwords | |
- transforming in lowercase | |
- removing excessive whitespaces | |
Args: | |
text (str): the input text you want to clean | |
remove_stopwords (bool): whether or not to remove stopwords | |
Returns: | |
str: the cleaned text | |
""" | |
# remove links | |
text = re.sub(r"http\S+", "", text) | |
# remove special chars and numbers | |
text = re.sub("[^A-Za-z]+", " ", text) | |
# remove stopwords | |
if remove_stopwords: | |
# 1. tokenize | |
tokens = nltk.word_tokenize(text) | |
# 2. check if stopword | |
tokens = [w for w in tokens if not w.lower() in stopwords.words("english")] | |
# 3. join back together | |
text = " ".join(tokens) | |
# return text in lower case and stripped of whitespaces | |
text = text.lower().strip() | |
return text |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment