Skip to content

Instantly share code, notes, and snippets.

@acrymble
Created July 5, 2011 19:16
Show Gist options
  • Save acrymble/1065624 to your computer and use it in GitHub Desktop.
Save acrymble/1065624 to your computer and use it in GitHub Desktop.
HTML to a List of Words
#html-to-list1.py
import urllib2, obo
url = 'http://www.oldbaileyonline.org/print.jsp?div=t17800628-33'
response = urllib2.urlopen(url)
html = response.read()
text = obo.stripTags(html).lower()
wordlist = obo.stripNonAlphaNum(text)
print wordlist[0:500]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment