Last active
August 29, 2015 13:56
-
-
Save benmarwick/9080777 to your computer and use it in GitHub Desktop.
How to find the topic with the highest proportion in a set of documents (after a topic model has been generated with the R package mallet)
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Which documents belong to each topic? | |
Documents don't belong to a single topic, there is a distribution of topics | |
over each document. | |
But we can Find the topic with the highest proportion for each document. | |
That top-ranking topic might be called the 'topic' for the document, but note | |
that all docs have all topics to varying proportions | |
Assume that we start with `topic_docs` from the output of the mallet package | |
as it has been used here: https://github.com/benmarwick/dayofarchaeology/blob/master/004_generate_topic_model.r | |
A little bit of cleaning first. | |
```{r} | |
# remove any NAs | |
topic_docs <- na.omit(topic_docs) | |
# add doc name if empty | |
colnames(topic_docs)[names(topic_docs) == ""] <- 'no.name' | |
``` | |
This will create a dataframe with each row as a document, and give the topic | |
number with the highest proportion for each document. It will also give the | |
value of the highest proportion. | |
```{r} | |
max.topics <- data.frame(max.topic.prop = as.numeric(sapply(topic_docs, function(i) max(i))), | |
max.topic.num = apply(topic_docs, 2, which.max)) | |
# clean up document names to use later | |
max.topics$document <- gsub('[[:punct:]]', ".", rownames(max.topics)) | |
max.topics$document <- gsub('[[:space:]]', ".", max.topics$document) | |
rownames(max.topics) <- NULL | |
``` | |
It's convinent to sort the table by topic number to see which documents | |
have the same top topic. | |
```{r} | |
library(dplyr) | |
max.topics <- arrange(max.topics, max.topic.num) | |
``` | |
We can also easily subset the documents by topic (using the dplyr package again). | |
```{r} | |
filter(max.topics, max.topic.num == 20) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment