Last active
January 25, 2023 10:19
-
-
Save ameyavilankar/10347201 to your computer and use it in GitHub Desktop.
Removing Punctuation and Stop Words nltk
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import string | |
import nltk | |
from nltk.tokenize import RegexpTokenizer | |
from nltk.corpus import stopwords | |
import re | |
def preprocess(sentence): | |
sentence = sentence.lower() | |
tokenizer = RegexpTokenizer(r'\w+') | |
tokens = tokenizer.tokenize(sentence) | |
filtered_words = [w for w in tokens if not w in stopwords.words('english')] | |
return " ".join(filtered_words) | |
sentence = "At eight o'clock on Thursday morning Arthur didn't feel very good. French-Fries" | |
print preprocess(sentence) |
Convert stopwords.words('english') to a set before using in line 11 ... It is currently a list and is incredibly slow for large documents.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
use:
filtered_words = filter(lambda token: token not in stopwords.words('english'), tokens)