Skip to content

Instantly share code, notes, and snippets.

@abevieiramota
Created March 8, 2015 10:56
Show Gist options
  • Save abevieiramota/e8ddc17a4dea903d4393 to your computer and use it in GitHub Desktop.
Save abevieiramota/e8ddc17a4dea903d4393 to your computer and use it in GitHub Desktop.
freqdist archive
import requests
from bs4 import BeautifulSoup
import nltk
URL = "https://archive.org/stream/TheH.P.LovecraftNovellas/H.P.Lovecraft-TheCaseOfCharlesDexterWard.txt"
# https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
PERMITTED_TAGS = set(['NN', 'NNS', 'NNP', 'NNPS'])
req = requests.get(URL)
soup = BeautifulSoup(req.text)
pre = soup.find_all('pre')[0]
text = pre.get_text()
tagged = nltk.pos_tag(nltk.word_tokenize(text))
permitted_words = [word[0] for word in tagged if word[1] in PERMITTED_TAGS]
fd = nltk.FreqDist(permitted_words)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment