benmarwick · August 29, 2015 13:56
diff --git a/docs-per-topic.rmd b/docs-per-topic.rmd
 Which documents belong to each topic?

 Documents don't belong to a single topic, there is a distribution of topics
 over each document. 

 But we can Find the topic with the highest proportion for each document.
 That top-ranking topic might be called the 'topic' for the document, but note
 that all docs have all topics to varying proportions

 Assume that we start with `topic_docs` from the output of the mallet package
 as it has been used here: https://github.com/benmarwick/dayofarchaeology/blob/master/004_generate_topic_model.r

 A little bit of cleaning first.
 ```{r}
 # remove any NAs
 topic_docs <- na.omit(topic_docs)
 # add doc name if empty
 colnames(topic_docs)[names(topic_docs) == ""] <- 'no.name'
 ```

 This will create a dataframe with each row as a document, and give the topic 
 number with the highest proportion for each document. It will also give the
 value of the highest proportion. 
 ```{r}
 max.topics <- data.frame(max.topic.prop = as.numeric(sapply(topic_docs, function(i) max(i))), 
                         max.topic.num =  apply(topic_docs, 2, which.max)) 
 # clean up document names to use later
 max.topics$document <- gsub('[[:punct:]]', ".", rownames(max.topics))
 max.topics$document <- gsub('[[:space:]]', ".", max.topics$document)
 rownames(max.topics) <- NULL
 ```

 It's convinent to sort the table by topic number to see which documents
 have the same top topic.

 ```{r}
 library(dplyr)
 max.topics <- arrange(max.topics, max.topic.num)
 ```

 We can also easily subset the documents by topic (using the dplyr package again).

 ```{r}
 filter(max.topics, max.topic.num == 20)
	Which documents belong to each topic?

	Documents don't belong to a single topic, there is a distribution of topics
	over each document.

	But we can Find the topic with the highest proportion for each document.
	That top-ranking topic might be called the 'topic' for the document, but note
	that all docs have all topics to varying proportions

	Assume that we start with `topic_docs` from the output of the mallet package
	as it has been used here: https://github.com/benmarwick/dayofarchaeology/blob/master/004_generate_topic_model.r

	A little bit of cleaning first.
	```{r}
	# remove any NAs
	topic_docs <- na.omit(topic_docs)
	# add doc name if empty
	colnames(topic_docs)[names(topic_docs) == ""] <- 'no.name'
	```

	This will create a dataframe with each row as a document, and give the topic
	number with the highest proportion for each document. It will also give the
	value of the highest proportion.
	```{r}
	max.topics <- data.frame(max.topic.prop = as.numeric(sapply(topic_docs, function(i) max(i))),
	max.topic.num = apply(topic_docs, 2, which.max))
	# clean up document names to use later
	max.topics$document <- gsub('[[:punct:]]', ".", rownames(max.topics))
	max.topics$document <- gsub('[[:space:]]', ".", max.topics$document)
	rownames(max.topics) <- NULL
	```

	It's convinent to sort the table by topic number to see which documents
	have the same top topic.

	```{r}
	library(dplyr)
	max.topics <- arrange(max.topics, max.topic.num)
	```

	We can also easily subset the documents by topic (using the dplyr package again).

	```{r}
	filter(max.topics, max.topic.num == 20)
No results found