Skip to content

Instantly share code, notes, and snippets.

@acrymble
Created July 5, 2011 19:28
Show Gist options
  • Save acrymble/1065665 to your computer and use it in GitHub Desktop.
Save acrymble/1065665 to your computer and use it in GitHub Desktop.
Python HTML to frequency pairs
#html-to-freq.py
import urllib2, obo
url = 'http://www.oldbaileyonline.org/print.jsp?div=t17800628-33'
response = urllib2.urlopen(url)
html = response.read()
text = obo.stripTags(html).lower()
wordlist = obo.stripNonAlphaNum(text)
dictionary = obo.wordListToFreqDict(wordlist)
sorteddict = obo.sortFreqDict(dictionary)
for s in sorteddict: print str(s)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment