In a blog post I wrote about the python package lda, see here, I used the pre-processed data (included with the lda package) for the example. I have since received many questions regarding the document-term matrix, the titles, and the vocabulary-- where do they come from? This gist will use the textmining package to (hopefully) help answer these types of questions.
To install textmining use pip (create a virtual environment first, if you'd like):
$ pip install textmining
The script can be run from the command with the usual command:
$ python lda_textmine_ex.py
The output should look like:
**These are the 'documents', making up our 'corpus':
document 1: John and Bob are brothers.
document 2: John went to the store. The store was closed.
document 3: Bob went to the store too.
-- In real applications, these 'documents' might be read from files, websites, etc.
**These are the 'document titles':
title 1: Brothers.
title 2: John to the store.
title 3: Bob to the store.
-- In real applications, these 'titles' might be the file name, the story title, webpage title, etc.
** The textmining packages is one tool for creating the 'document-term' matrix, 'vocabulary', etc.
You can write your own, if needed.
** Output produced by the textmining package...
* The 'document-term' matrix
type(X): <type 'numpy.ndarray'>
shape: (3, 12)
X:
[[1 0 1 0 1 0 1 1 0 0 0 0]
[0 2 0 1 0 1 0 1 1 1 2 0]
[0 1 0 1 0 0 1 0 0 1 1 1]]
-- Notice there are 3 rows, for 3 'documents' and
12 columns, for 12 'vocabulary' words
-- The number of rows and columns depends on the number of documents
and number of unique words in -all- documents
* The 'vocabulary':
type(vocab): <type 'tuple'>
len(vocab): 12
vocab:
('and', 'the', 'brothers', 'to', 'are', 'closed', 'bob', 'john', 'was', 'went', 'store', 'too')
-- These are the 12 words in the vocabulary
-- Often common 'stop' words, like 'and', 'the', 'to', etc are
filtered out -before- creating the document-term matrix and vocab
* Again, the 'titles' for this 'corpus':
type(titles): <type 'tuple'>
len(titles): 3
titles:
('Brothers.', 'John to the store.', 'Bob to the store.')
Hopefully this gives a sense of how a set of documents (a corpus) relates to the document-term matrix, the vocabulary, and the titles mentioned in the original post.