Skip to content

Instantly share code, notes, and snippets.

@jmrobles
Created August 15, 2020 09:19
Show Gist options
  • Save jmrobles/c95ec3b3cedec0e5a72e10db5019df60 to your computer and use it in GitHub Desktop.
Save jmrobles/c95ec3b3cedec0e5a72e10db5019df60 to your computer and use it in GitHub Desktop.
def lemmatizer(doc):
"""
This takes in a doc of tokens from the NER and lemmatizes them.
Pronouns (like "I" and "you" get lemmatized to '-PRON-', so I'm removing those.
"""
doc = [token.lemma_ for token in doc if token.lemma_ != '-PRON-']
doc = u' '.join(doc)
return nlp.make_doc(doc)
def remove_stopwords(doc):
"""
This will remove stopwords and punctuation.
Use token.text to return strings, which we'll need for Gensim.
"""
doc = [token.text for token in doc if token.is_stop != True and token.is_punct != True and token.text.strip() != '' and token.is_digit != True]
return doc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment