Skip to content

Instantly share code, notes, and snippets.

@michael-erasmus
Created November 2, 2015 23:56
Show Gist options
  • Save michael-erasmus/9b221fcedbe8a4a7cce6 to your computer and use it in GitHub Desktop.
Save michael-erasmus/9b221fcedbe8a4a7cce6 to your computer and use it in GitHub Desktop.
TF-IDF keywords using Graphlab
import re
import graphlab
#remove html tags
docs['words'] = docs['body'].apply(lambda doc: re.sub("<[^>]*>", "", doc))
#remove punctuation, whitespace and lowercase it all
docs['words'] = docs['words'].apply(lambda doc: re.sub("[\W\d]", " ", doc.lower().strip()))
docs = graphlab.SFrame(docs)
docs['word_counts'] = graphlab.text_analytics.count_words(docs['words'])
docs_tfidf = graphlab.text_analytics.tf_idf(docs['words'])
docs['top10'] = docs_tfidf['docs'].apply(lambda t: " ".join(sorted(t, key=t.get, reverse=True)[1:10]))
docs.save("billing_key_words.csv", format='csv')
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment