Skip to content

Instantly share code, notes, and snippets.

@chrishokamp
Last active August 29, 2015 14:01
Show Gist options
  • Save chrishokamp/6f3c5b9e64cb845c71ea to your computer and use it in GitHub Desktop.
Save chrishokamp/6f3c5b9e64cb845c71ea to your computer and use it in GitHub Desktop.
gensim lda, hierarchical lda, and lsi demo
{
"metadata": {
"name": "gensim_tutorial"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": "# Using LDA, Hierarchical LDA, and LSI in gensim \n\nThis notebook is modeled directly after the awesome gensim tutorials at http://radimrehurek.com/gensim"
},
{
"cell_type": "code",
"collapsed": false,
"input": "from gensim import corpora, models, similarities\n\n# uncomment the next two lines if you want to see the logging statements (pretty useful while hacking)\n#import logging\n#logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 1
},
{
"cell_type": "code",
"collapsed": false,
"input": "documents = [\"Human machine interface for lab abc computer applications\",\n \"A survey of user opinion of computer system response time\",\n \"The EPS user interface management system\",\n \"System and human system engineering testing of EPS\",\n \"Relation of user perceived response time to error measurement\",\n \"The generation of random binary unordered trees\",\n \"The intersection graph of paths in trees\",\n \"Graph minors IV Widths of trees and well quasi ordering\",\n \"Graph minors A survey\"]\n\nstoplist = set('for a of the and to in'.split())\ntexts = [[word for word in document.lower().split() if word not in stoplist] for document in documents ]\n\nall_tokens = sum(texts, [])\ntokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)\n\ntexts = [[word for word in text if word not in tokens_once] for text in texts]\n\nprint(texts)",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "[['human', 'interface', 'computer'], ['survey', 'user', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'system'], ['system', 'human', 'system', 'eps'], ['user', 'response', 'time'], ['trees'], ['graph', 'trees'], ['graph', 'minors', 'trees'], ['graph', 'minors', 'survey']]\n"
}
],
"prompt_number": 2
},
{
"cell_type": "code",
"collapsed": false,
"input": "# this creates a dictionary mapping tokens in the corpus to integer ids \ndictionary = corpora.Dictionary(texts)\nprint(dictionary.items())\n\n# persist the dictionary if you want\n# dictionary.save('/tmp/deerwester.dict')",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "[(11, 'minors'), (10, 'graph'), (5, 'system'), (9, 'trees'), (8, 'eps'), (0, 'computer'), (4, 'survey'), (7, 'user'), (1, 'human'), (6, 'time'), (2, 'interface'), (3, 'response')]\n"
}
],
"prompt_number": 3
},
{
"cell_type": "code",
"collapsed": false,
"input": "# creating a new document vector using the vocabulary of the token dictionary we created above\nnew_doc = \"user human program response\"\nnew_vec = dictionary.doc2bow(new_doc.lower().split())\nprint(new_vec) # the word \"response\" does not appear in the dictionary and is ignored",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "[(1, 1), (3, 1), (7, 1)]\n"
}
],
"prompt_number": 4
},
{
"cell_type": "code",
"collapsed": false,
"input": "# create a corpus from the documents \ncorpus = [dictionary.doc2bow(text) for text in texts]\ncorpora.MmCorpus.serialize('/tmp/deerwester.mm', corpus) # store to disk, for later use",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 5
},
{
"cell_type": "code",
"collapsed": false,
"input": "# deserializing (just to show how gensim does it)\n# token dict\nexisting_dictionary = corpora.Dictionary.load('/tmp/deerwester.dict')\n\nexisting_corpus = corpora.MmCorpus('/tmp/deerwester.mm')",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 6
},
{
"cell_type": "code",
"collapsed": false,
"input": "# create a tfidf transformation from our corpus of counts\ntfidf = models.TfidfModel(existing_corpus)\n\n# now we can transform new vectors with the tfidf model\ndoc_bow = [(0, 1), (1, 1)] # the doc - ['human', 'computer']\nprint(tfidf[doc_bow])",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "[(0, 0.7071067811865476), (1, 0.7071067811865476)]\n"
}
],
"prompt_number": 7
},
{
"cell_type": "markdown",
"metadata": {},
"source": "**note from the gensim tutorials** \nTransformations always convert between two specific vector spaces. The same vector space (= the same set of feature ids) must be used for training as well as for subsequent vector transformations. Failure to use the same input feature space, such as applying a different string preprocessing, using different feature ids, or using bag-of-words input vectors where TfIdf vectors are expected, will result in feature mismatch during transformation calls and consequently in either garbage output and/or runtime exceptions."
},
{
"cell_type": "code",
"collapsed": false,
"input": "# now let's transform the corpus into the tfidf space\ncorpus_tfidf = tfidf[corpus]\nfor doc in corpus_tfidf:\n print(doc) ",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "[(0, 0.5773502691896257), (1, 0.5773502691896257), (2, 0.5773502691896257)]\n[(0, 0.44424552527467476), (3, 0.44424552527467476), (4, 0.44424552527467476), (5, 0.3244870206138555), (6, 0.44424552527467476), (7, 0.3244870206138555)]\n[(2, 0.5710059809418182), (5, 0.4170757362022777), (7, 0.4170757362022777), (8, 0.5710059809418182)]\n[(1, 0.49182558987264147), (5, 0.7184811607083769), (8, 0.49182558987264147)]\n[(3, 0.6282580468670046), (6, 0.6282580468670046), (7, 0.45889394536615247)]\n[(9, 1.0)]\n[(9, 0.7071067811865475), (10, 0.7071067811865475)]\n[(9, 0.5080429008916749), (10, 0.5080429008916749), (11, 0.695546419520037)]\n[(4, 0.6282580468670046), (10, 0.45889394536615247), (11, 0.6282580468670046)]\n"
}
],
"prompt_number": 8
},
{
"cell_type": "code",
"collapsed": false,
"input": "# cool, let's train standard LDA with 3 topics\n# see: https://github.com/piskvorky/gensim/blob/develop/gensim/models/ldamodel.py#L183\nlda = models.LdaModel(corpus_tfidf, id2word=dictionary, num_topics=3, passes=20)\nlda.show_topics()",
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 9,
"text": "['0.136*user + 0.132*interface + 0.125*response + 0.125*time + 0.120*computer + 0.080*human + 0.077*system + 0.076*eps + 0.040*survey + 0.030*trees',\n '0.182*system + 0.169*survey + 0.128*eps + 0.122*human + 0.051*minors + 0.050*computer + 0.050*time + 0.050*response + 0.050*graph + 0.050*user',\n '0.273*trees + 0.214*graph + 0.176*minors + 0.049*survey + 0.036*human + 0.036*eps + 0.036*system + 0.036*computer + 0.036*time + 0.036*interface']"
}
],
"prompt_number": 9
},
{
"cell_type": "code",
"collapsed": false,
"input": "# prep the similarity matrix object\nlda_index = similarities.MatrixSimilarity(lda[corpus]) # transform corpus to LDA space and index it\n\n# now simulate a user query\n# a new query \nquery = 'computer user'\nquery_bow = dictionary.doc2bow(query.lower().split())\nquery_lda = lda[query_bow] # map the query into the LDA space\n\n# now we can check which doc is most similar\nsims = lda_index[query_lda] # perform a similarity query against the corpus\n# order decending\nsims = sorted(enumerate(sims), key=lambda item: -item[1])\n# print with the scores, and the original text\nfor idx, score in sims:\n print('doc_index: {} - score: {}\\n original doc: {}'.format(idx, score, documents[idx]))",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "doc_index: 0 - score: 0.998648226261\n original doc: Human machine interface for lab abc computer applications\ndoc_index: 4 - score: 0.998213946819\n original doc: Relation of user perceived response time to error measurement\ndoc_index: 2 - score: 0.998129725456\n original doc: The EPS user interface management system\ndoc_index: 1 - score: 0.985193908215\n original doc: A survey of user opinion of computer system response time\ndoc_index: 5 - score: 0.398836731911\n original doc: The generation of random binary unordered trees\ndoc_index: 8 - score: 0.302747189999\n original doc: Graph minors A survey\ndoc_index: 6 - score: 0.296425819397\n original doc: The intersection graph of paths in trees\ndoc_index: 7 - score: 0.251685649157\n original doc: Graph minors IV Widths of trees and well quasi ordering\ndoc_index: 3 - score: 0.233852759004\n original doc: System and human system engineering testing of EPS\n"
}
],
"prompt_number": 10
},
{
"cell_type": "code",
"collapsed": false,
"input": "# same thing with hierarchical LDA - note there is NO num_topics parameter\n# see: http://radimrehurek.com/gensim/models/hdpmodel.html\nhlda = models.HdpModel(corpus_tfidf, id2word=dictionary)\n\nimport logging\nlogging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)\n# enable logging to see the printed topics\nhlda.print_topics(topics=10, topn=10)",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stderr",
"text": "2014-05-23 12:54:05,467 : INFO : topic 0: 0.178*minors + 0.161*interface + 0.154*time + 0.103*user + 0.097*graph + 0.085*system + 0.070*human + 0.057*eps + 0.042*trees + 0.025*survey\n"
},
{
"output_type": "stream",
"stream": "stderr",
"text": "2014-05-23 12:54:05,468 : INFO : topic 1: 0.402*system + 0.211*interface + 0.111*human + 0.076*survey + 0.057*trees + 0.048*computer + 0.037*eps + 0.030*response + 0.025*time + 0.003*graph\n"
},
{
"output_type": "stream",
"stream": "stderr",
"text": "2014-05-23 12:54:05,468 : INFO : topic 2: 0.277*response + 0.193*user + 0.131*eps + 0.083*computer + 0.078*survey + 0.067*trees + 0.054*time + 0.047*human + 0.033*minors + 0.029*graph\n"
},
{
"output_type": "stream",
"stream": "stderr",
"text": "2014-05-23 12:54:05,468 : INFO : topic 3: 0.420*computer + 0.206*trees + 0.131*interface + 0.064*response + 0.048*minors + 0.034*survey + 0.032*human + 0.019*graph + 0.019*time + 0.016*user\n"
},
{
"output_type": "stream",
"stream": "stderr",
"text": "2014-05-23 12:54:05,469 : INFO : topic 4: 0.203*human + 0.192*response + 0.166*minors + 0.115*system + 0.095*trees + 0.091*graph + 0.048*user + 0.047*interface + 0.020*computer + 0.016*time\n"
},
{
"output_type": "stream",
"stream": "stderr",
"text": "2014-05-23 12:54:05,469 : INFO : topic 5: 0.204*interface + 0.198*trees + 0.104*survey + 0.099*human + 0.091*computer + 0.085*response + 0.078*user + 0.070*minors + 0.029*system + 0.024*eps\n"
},
{
"output_type": "stream",
"stream": "stderr",
"text": "2014-05-23 12:54:05,469 : INFO : topic 6: 0.409*interface + 0.226*survey + 0.110*system + 0.071*eps + 0.059*human + 0.050*response + 0.024*trees + 0.021*graph + 0.017*user + 0.005*time\n"
},
{
"output_type": "stream",
"stream": "stderr",
"text": "2014-05-23 12:54:05,470 : INFO : topic 7: 0.271*time + 0.187*minors + 0.111*computer + 0.106*human + 0.095*trees + 0.083*graph + 0.050*survey + 0.034*eps + 0.030*response + 0.029*interface\n"
},
{
"output_type": "stream",
"stream": "stderr",
"text": "2014-05-23 12:54:05,470 : INFO : topic 8: 0.262*interface + 0.192*computer + 0.190*minors + 0.118*user + 0.078*trees + 0.064*human + 0.031*response + 0.029*graph + 0.021*system + 0.009*eps\n"
},
{
"output_type": "stream",
"stream": "stderr",
"text": "2014-05-23 12:54:05,470 : INFO : topic 9: 0.193*minors + 0.187*trees + 0.123*user + 0.116*time + 0.080*survey + 0.077*eps + 0.071*human + 0.061*interface + 0.057*response + 0.029*system\n"
}
],
"prompt_number": 11
},
{
"cell_type": "code",
"collapsed": false,
"input": "# prep the similarity matrix object\nhlda_index = similarities.MatrixSimilarity(hlda[corpus]) # transform corpus to HLDA space and index it\n\n# now simulate a user query\n# a new query \nquery = 'graph user response'\nquery_bow = dictionary.doc2bow(query.lower().split())\nquery_hlda = hlda[query_bow] # map the query into the LDA space\n\n# now we can check which doc is most similar\nsims = hlda_index[query_hlda] # perform a similarity query against the corpus\n# order decending\nsims = sorted(enumerate(sims), key=lambda item: -item[1])\n# print with the scores, and the original text\nfor idx, score in sims:\n print('doc_index: {} - score: {}\\n original doc: {}'.format(idx, score, documents[idx]))",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stderr",
"text": "2014-05-23 12:54:05,477 : WARNING : scanning corpus to determine the number of features (consider setting `num_features` explicitly)\n"
},
{
"output_type": "stream",
"stream": "stderr",
"text": "2014-05-23 12:54:05,492 : INFO : creating matrix for 9 documents and 9 features\n"
},
{
"output_type": "stream",
"stream": "stdout",
"text": "doc_index: 4 - score: 0.999995589256\n original doc: Relation of user perceived response time to error measurement\ndoc_index: 1 - score: 0.95928555727\n original doc: A survey of user opinion of computer system response time\ndoc_index: 5 - score: 0.215702205896\n original doc: The generation of random binary unordered trees\ndoc_index: 6 - score: 0.161945030093\n original doc: The intersection graph of paths in trees\ndoc_index: 8 - score: 0.140924185514\n original doc: Graph minors A survey\ndoc_index: 7 - score: 0.140730023384\n original doc: Graph minors IV Widths of trees and well quasi ordering\ndoc_index: 0 - score: 0.13894803822\n original doc: Human machine interface for lab abc computer applications\ndoc_index: 2 - score: 0.129468992352\n original doc: The EPS user interface management system\ndoc_index: 3 - score: 0.100748173892\n original doc: System and human system engineering testing of EPS\n"
}
],
"prompt_number": 12
},
{
"cell_type": "code",
"collapsed": false,
"input": "# cool, now let's train LSI with 3 topics\nlsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=3) # initialize an LSI transformation\nlsi_index = similarities.MatrixSimilarity(lsi[corpus]) # transform corpus to LSI space and index it\n\n# a new query \nquery = 'computer user'\nquery_bow = dictionary.doc2bow(query.lower().split())\nquery_lsi = lsi[query_bow] # convert the query to LSI space\n\n# now we can check which doc is most similar\nsims = lsi_index[query_lsi] # perform a similarity query against the corpus\n# order decending\nsims = sorted(enumerate(sims), key=lambda item: -item[1])\n# print with the scores, and the original text\nfor idx, score in sims:\n print('doc_index: {} - score: {}\\n original doc: {}'.format(idx, score, documents[idx]))",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stderr",
"text": "2014-05-23 12:54:05,516 : INFO : using serial LSI version on this node\n"
},
{
"output_type": "stream",
"stream": "stderr",
"text": "2014-05-23 12:54:05,517 : INFO : updating model with new documents\n"
},
{
"output_type": "stream",
"stream": "stderr",
"text": "2014-05-23 12:54:05,517 : INFO : preparing a new chunk of documents\n"
},
{
"output_type": "stream",
"stream": "stderr",
"text": "2014-05-23 12:54:05,518 : INFO : using 100 extra samples and 2 power iterations\n"
},
{
"output_type": "stream",
"stream": "stderr",
"text": "2014-05-23 12:54:05,518 : INFO : 1st phase: constructing (12, 103) action matrix\n"
},
{
"output_type": "stream",
"stream": "stderr",
"text": "2014-05-23 12:54:05,519 : INFO : orthonormalizing (12, 103) action matrix\n"
},
{
"output_type": "stream",
"stream": "stderr",
"text": "2014-05-23 12:54:05,520 : INFO : 2nd phase: running dense svd on (12, 9) matrix\n"
},
{
"output_type": "stream",
"stream": "stderr",
"text": "2014-05-23 12:54:05,521 : INFO : computing the final decomposition\n"
},
{
"output_type": "stream",
"stream": "stderr",
"text": "2014-05-23 12:54:05,521 : INFO : keeping 3 factors (discarding 31.801% of energy spectrum)\n"
},
{
"output_type": "stream",
"stream": "stderr",
"text": "2014-05-23 12:54:05,522 : INFO : processed documents up to #9\n"
},
{
"output_type": "stream",
"stream": "stderr",
"text": "2014-05-23 12:54:05,522 : INFO : topic #0(1.594): 0.703*\"trees\" + 0.538*\"graph\" + 0.402*\"minors\" + 0.187*\"survey\" + 0.061*\"system\" + 0.060*\"response\" + 0.060*\"time\" + 0.058*\"user\" + 0.049*\"computer\" + 0.035*\"interface\"\n"
},
{
"output_type": "stream",
"stream": "stderr",
"text": "2014-05-23 12:54:05,523 : INFO : topic #1(1.476): -0.460*\"system\" + -0.373*\"user\" + -0.332*\"eps\" + -0.328*\"interface\" + -0.320*\"time\" + -0.320*\"response\" + -0.293*\"computer\" + -0.280*\"human\" + -0.171*\"survey\" + 0.161*\"trees\"\n"
},
{
"output_type": "stream",
"stream": "stderr",
"text": "2014-05-23 12:54:05,523 : INFO : topic #2(1.191): 0.456*\"time\" + 0.456*\"response\" + -0.352*\"eps\" + -0.340*\"human\" + -0.318*\"interface\" + -0.277*\"system\" + 0.272*\"survey\" + 0.213*\"user\" + -0.183*\"trees\" + 0.114*\"minors\"\n"
},
{
"output_type": "stream",
"stream": "stderr",
"text": "2014-05-23 12:54:05,524 : WARNING : scanning corpus to determine the number of features (consider setting `num_features` explicitly)\n"
},
{
"output_type": "stream",
"stream": "stderr",
"text": "2014-05-23 12:54:05,524 : INFO : creating matrix for 9 documents and 3 features\n"
},
{
"output_type": "stream",
"stream": "stdout",
"text": "doc_index: 1 - score: 0.977386057377\n original doc: A survey of user opinion of computer system response time\ndoc_index: 4 - score: 0.869656026363\n original doc: Relation of user perceived response time to error measurement\ndoc_index: 2 - score: 0.719187498093\n original doc: The EPS user interface management system\ndoc_index: 0 - score: 0.592006981373\n original doc: Human machine interface for lab abc computer applications\ndoc_index: 3 - score: 0.545888721943\n original doc: System and human system engineering testing of EPS\ndoc_index: 8 - score: 0.299895048141\n original doc: Graph minors A survey\ndoc_index: 7 - score: -0.00787871703506\n original doc: Graph minors IV Widths of trees and well quasi ordering\ndoc_index: 6 - score: -0.0641105175018\n original doc: The intersection graph of paths in trees\ndoc_index: 5 - score: -0.135604828596\n original doc: The generation of random binary unordered trees\n"
}
],
"prompt_number": 13
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment