Skip to content

Instantly share code, notes, and snippets.

@benmarwick
Last active August 29, 2015 13:56
Show Gist options
  • Save benmarwick/9080777 to your computer and use it in GitHub Desktop.
Save benmarwick/9080777 to your computer and use it in GitHub Desktop.
How to find the topic with the highest proportion in a set of documents (after a topic model has been generated with the R package mallet)
Which documents belong to each topic?
Documents don't belong to a single topic, there is a distribution of topics
over each document.
But we can Find the topic with the highest proportion for each document.
That top-ranking topic might be called the 'topic' for the document, but note
that all docs have all topics to varying proportions
Assume that we start with `topic_docs` from the output of the mallet package
as it has been used here: https://github.com/benmarwick/dayofarchaeology/blob/master/004_generate_topic_model.r
A little bit of cleaning first.
```{r}
# remove any NAs
topic_docs <- na.omit(topic_docs)
# add doc name if empty
colnames(topic_docs)[names(topic_docs) == ""] <- 'no.name'
```
This will create a dataframe with each row as a document, and give the topic
number with the highest proportion for each document. It will also give the
value of the highest proportion.
```{r}
max.topics <- data.frame(max.topic.prop = as.numeric(sapply(topic_docs, function(i) max(i))),
max.topic.num = apply(topic_docs, 2, which.max))
# clean up document names to use later
max.topics$document <- gsub('[[:punct:]]', ".", rownames(max.topics))
max.topics$document <- gsub('[[:space:]]', ".", max.topics$document)
rownames(max.topics) <- NULL
```
It's convinent to sort the table by topic number to see which documents
have the same top topic.
```{r}
library(dplyr)
max.topics <- arrange(max.topics, max.topic.num)
```
We can also easily subset the documents by topic (using the dplyr package again).
```{r}
filter(max.topics, max.topic.num == 20)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment