Skip to content

Instantly share code, notes, and snippets.

@davidrichards
Created May 25, 2015 20:14
Show Gist options
  • Save davidrichards/2c76663ec75d6b55a813 to your computer and use it in GitHub Desktop.
Save davidrichards/2c76663ec75d6b55a813 to your computer and use it in GitHub Desktop.
Term Demo
from bs4 import BeautifulSoup
from urllib import urlopen
import textmining
urls = [
'http://www.webmd.com/cancer/childhood-leukemia-symptoms-treatments',
'http://kidshealth.org/parent/medical/cancer/cancer_leukemia.html',
'http://www.nlm.nih.gov/medlineplus/childhoodleukemia.html'
]
matrix = textmining.TermDocumentMatrix()
def add_document(m, url):
html = urlopen(url).read()
soup = BeautifulSoup(html)
text = soup.get_text()
m.add_doc(text)
return True
import numpy as np
import pandas
terms = pandas.read_csv('matrix.csv')
data = terms.values
U, s, V = np.linalg.svd(data.T)
S = np.diag(s)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment