Last active
December 20, 2019 16:20
-
-
Save glickmac/bf2c0e10d52d897522a004c2f7d2f9b3 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
def text_processing(input_text): | |
tokens = tokenizer.tokenize(input_text) | |
lemmatizer = WordNetLemmatizer() | |
tokens = [lemmatizer.lemmatize(i) for i in tokens] | |
stops = set(stopwords.words('english')) | |
values = [i for i in tokens if i not in stops] | |
weird = ["wa", "u"] | |
values = [i for i in values if i not in weird] | |
return(values) | |
values = text_processing(text) | |
print("The number of unique words is: " + str(len(set(values)))) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment