benmarwick · December 19, 2015 03:28
diff --git a/Off-Our-Backs.rmd b/Off-Our-Backs.rmd
 Quick look at 'Off Our Backs' with some of text mining functions in the R package [JSTORr][id1]
 ========================================================

 Here I demonstrate some basic text mining of a selection of articles from 'Off Our Backs' that are held by JSTOR. I was motivated by [this blog post][id]. I tried with all articles that are one or more pages, but that resulted in some corrupt data (empty CSV files), so the data here are all articles with two or more pages.  The methods here come from the R package [JSTORr][id1] and use the article as the basic unit of analysis. The motivation for this is that [JSTOR's DFR][id2] data are provided at the article level, and that does seem to provide interesting data at the scale of multiple thousands of articles. However, this is a lower level of resolution than corpus linguistics methods that use the word as the unit of analysis. So the results here might not compare well with corpus linguistics methods for a small sample of articles. 
 ```{r, echo=TRUE, message=FALSE, cache=TRUE, warning = FALSE, results='hide'}
 # prepare environment for analysis
 # assign working directory (ie. the location of the 
 # downloaded unzipped DFR archive)
 wd <- "C:/Users/marwick//Downloads/2013.6.27.j7MTwUWC_OOB"
 setwd(wd)
 # load package 
 # see https://github.com/UW-ARCHY-textual-macroanalysis-lab/JSTORr 
 # for details on how to download and get started 
 require(JSTORr)
 # bring JSTOR DFR data into R
 unpack1grams <- JSTOR_unpack1grams()
 # process the data to remove unclean text, OCR errors, etc.
 nouns <- JSTOR_dtmofnouns(unpack1grams, POStag = FALSE)
 # I've already removed the most common stopwords, but
 # let's remove a few specific to this archive
 OOB_stopwords <- c("woman", "women", "people", 
                   "will", "said", "many", "much",
                   "time", "world", "movement",
                   "page", "conference", "issues",
                   "feminist")
 nouns <- nouns[, !(nouns$dimnames$Terms %in% OOB_stopwords) ]
 ```
 The document term matrix contains `r length(unpack1grams$wordcounts$dimnames$Docs)` documents and `r length(unpack1grams$wordcounts$dimnames$Terms)` words. This is a relatively small collecton for the methods used here, so we'll proceed with caution. 
 \newpage
 Let's see what are the most frequent words over time.
 ```{r, echo=TRUE, message=FALSE, fig.height=6, fig.width=8, cache=TRUE, results='hide', warning = FALSE, fig.cap="Most frequent words over time"}
 freq <- JSTOR_freqwords(unpack1grams, nouns)
 ```
 \newpage
 Looks like an interesting change in the word *black*, seems that it's very frequent from 1970 to 1995. Let's have another look at that. We have fewer articles (ie. less dots) after 1997, after which *lesbian*, *rights* and *children* become more frequent. Why whould that be? The mean frequency per article of *black*, indicated by the blue line, doesn't change much.
 ```{r, echo=TRUE, message=FALSE, fig.height=4, cache=TRUE, results='hide', warning = FALSE, fig.cap="Frequency of *black* over time"}
 JSTOR_1word(unpack1grams, "black")
 ```
 \newpage
 Let's see how use of the word *black* changes over time by investigating strongly correlated words. This may repay more detailed study of the table of words. 
 ```{r, echo=TRUE, message=FALSE, fig.height=8, fig.width=10, cache=TRUE, warning = FALSE, fig.cap="Words correlated with *black* over time", results='hide'}
 black <- JSTOR_findassocs(unpack1grams, nouns, 
                          "black", n = 5, topn = 15)
 ```
 \newpage
 Here's the table of highly correlated words
 ```{r, echo=TRUE, message=FALSE,  cache=TRUE, warning = FALSE, results='asis'}
 require(pander)
 set.caption('Words correlated with *black* (top 15 words per year interval)')
 black_tab <- black[black$words != "", ]
 pandoc.table(black_tab, style="rmarkdown")
 ```
 \newpage
 We can see if there appear to be clusters amongst the documents containing the word *black*. 
 ```{r, echo=TRUE, message=FALSE,  cache=TRUE, warning = FALSE, results='hide', include=FALSE}
 black_clus <- JSTOR_clusterbywords(nouns, 'black')
 ```
 ![k-means clusters of documents containing *black*](/Users/marwick/Downloads/2013.6.27.j7MTwUWC_OOB/figure/bla.png)
 \newpage
 We can see what words define cluster membership, check out cluster 42.
 ```{r, echo=TRUE, message=FALSE,  cache=TRUE}
 black_clus$kmeans
 ```
 \newpage
 And we can see what words are most responsible for variation along the two axes
 ```{r, echo=TRUE, message=FALSE,  cache=TRUE}
 contrib <- data.frame(black_clus$PCA$var$contrib)
 # x-axis
 row.names(contrib[with(contrib, order(Dim.1)), ][1:20,])
 # y-axis
 row.names(contrib[with(contrib, order(Dim.2)), ][1:20,])
 ```
 \newpage
 Let's have a look how a few words vary over time. How many times does the word 'essentialism' occur? In this set of articles that are at least two pages long, this word occurs `r sum(unpack1grams$wordcounts[,unpack1grams$wordcounts$dimnames$Terms == "essentialism"])` times. We can plot the change in frequency of this word over time.
 ```{r, echo=TRUE, message=FALSE, fig.height=4, cache=TRUE, ,results='hide', warning = FALSE, fig.cap="Frequency of *essentialism* over time"}
 JSTOR_1word(unpack1grams, "essentialism", se=FALSE)
 ```
 \newpage
 We can plot the top words correlated with 'essentialism' over the duration of the archive. 
 ```{r, echo=TRUE, message=FALSE, fig.height=8, fig.width=8, cache=TRUE, warning = FALSE, fig.cap="Words correlated with *essentialism* over time", results='hide'}
 essentialism <- JSTOR_findassocs(unpack1grams, nouns, 
                                 "essentialism", n = 5, topn = 15)
 ```
 \newpage
 We can also get a table of some of these highly correlated words. Many of the odd-looking words are due to OCR errors or hyphenation.
 ```{r, echo=TRUE, message=FALSE,  cache=TRUE, warning = FALSE, results='asis'}
 require(pander)
 set.caption('Words correlated with *essentialism* (top 15 words per year interval)')
 essentialism_tab <- essentialism[essentialism$words != "", ]
 pandoc.table(essentialism_tab, style="rmarkdown")
 ```
 \newpage
 Now let's try 'biological', which occurs `r sum(unpack1grams$wordcounts[,unpack1grams$wordcounts$dimnames$Terms == "biological"])` times in the archive. We can plot the frequency of this word over time and plot and tabulate highly correlated words.
 ```{r, echo=TRUE, message=FALSE, fig.height=4, cache=TRUE, warning = FALSE, fig.cap="Frequency of *biological* over time"}
 JSTOR_1word(unpack1grams, "biological", se=FALSE)
 ```
 \newpage
 ```{r, echo=TRUE, message=FALSE, fig.height=8, fig.width=8, cache=TRUE, warning = FALSE, fig.cap="Words correlated with *biological* over time", results='hide'}
 biological <- JSTOR_findassocs(unpack1grams, nouns, 
                               "biological", n = 5, topn = 15)
 ```
 \newpage
 ```{r, echo=TRUE, message=FALSE,  cache=TRUE, warning = FALSE, results='asis'}
 require(pander)
 set.caption('Words correlated with *biological* (top 15 words per year interval)')
 biological_tab <- biological[biological$words != "", ]
 pandoc.table(biological_tab, style="rmarkdown")
 ```
 \newpage
 We can investigate two-grams such as *biological determinism*, *biological explanation* and *biological difference*. 
 ```{r, echo=TRUE, message=FALSE, cache=TRUE, warning = FALSE, results='hide', tidy=FALSE}
 # bring 2-grams into R
 unpack2grams <- JSTOR_unpack2grams()
 # make a vector of bigrams
 twogs <- c("biological determinism", "biological explanation", 
           "biological difference", "hormonal determinism")
 # calculate abundance of bigrams
 require(plyr)
 counts <- ldply(twogs, function(i) 
  colSums(as.matrix(unpack2grams$bigrams[,unpack2grams$bigrams$dimnames$Terms 
                                         == i])))
 rownames(counts) <- twogs
 names(counts) <- "count"
 ```
 These terms are not frequent in this archive, so little use for the methods in this package
 ```{r, echo=TRUE, message=FALSE,  cache=TRUE, warning = FALSE, results='asis'}
 require(pander)
 set.caption('Frequency of select bigrams')
 pandoc.table(counts, style="rmarkdown")
 ```
 \newpage
 We can look at the correlation of *essentialism* and *biological*. They are positively correlated, as expected, but there are very few data points. This would be more interesting with a larger archive. 
 ```{r, echo=TRUE, message=FALSE,  cache=TRUE, warning = FALSE, results='asis', fig.cap="Correlation of *biological* and *essentialism* over time"}
 JSTOR_2wordcor(unpack1grams, "essentialism", "biological")
 ````
 \newpage
 We can see if there appear to be clusters amongst the documents containing the word *biological*. 
 ```{r, echo=TRUE, message=FALSE,  cache=TRUE, warning = FALSE, results='hide', include=FALSE}
 biological_clus <- JSTOR_clusterbywords(nouns, 'biological')
 ```
 ![k-means clusters of documents containing *biological*](/Users/marwick/Downloads/2013.6.27.j7MTwUWC_OOB/figure/pca.png)
 \newpage
 We can see what words define cluster membership, check out cluster 10.
 ```{r, echo=TRUE, message=FALSE,  cache=TRUE}
 biological_clus$kmeans
 ```
 \newpage
 And we can see what words are most responsible for variation along the two axes
 ```{r, echo=TRUE, message=FALSE,  cache=TRUE}
 contrib <- data.frame(biological_clus$PCA$var$contrib)
 # x-axis
 row.names(contrib[with(contrib, order(Dim.1)), ][1:20,])
 # y-axis
 row.names(contrib[with(contrib, order(Dim.2)), ][1:20,])
 ```

 So there's a short and quick demonstration of some of the basic functions in teh JSTORr package. All the code is included here so the entire analysis can be replicated by anyone with a basic familiarity with R. Since you've already got a topic modelling workflow, I wont bother with that here. 

 Since this area of inquiry is a bit unfamiliar to me, I'm not going to hazard any interpretations, but I hope there's something intriguing in there for you. The methods in the [JSTORr][id1] package are probably at their most effective with a larger archive of articles over a longer time period. Perhaps if several similar journals were put together we might get some more reliable data.  

 ---------------------
 Ben Marwick, June 2013

 ```{r, echo=FALSE, message=FALSE, eval=FALSE}
 # This chunck is to run the code and generate the PDF
 # it will not appear in the PDF
 # Load packages
 setwd(wd) # assumes wd has been set earlier in the doc
 require(knitr)
 require(markdown)
 
 # process .md and .pdf files (including smart punctuation and grey background of code blocks)
 filen <- "OOD" # name of this markdown file without suffix
 knit(paste0(filen,".md"))
 system(paste0("pandoc -s ", paste0(filen,"-out.md"), " -t latex -o ", paste0(filen,".pdf"), " --highlight-style=tango  -S"))
 ```

 [id]: http://historyinthecity.blogspot.com/2013/06/lessons-in-antconc-3-with-heather.html
 [id1]: https://github.com/UW-ARCHY-textual-macroanalysis-lab/JSTORr
 [id2]: http://dfr.jstor.org/
	Quick look at 'Off Our Backs' with some of text mining functions in the R package [JSTORr][id1]
	========================================================

	Here I demonstrate some basic text mining of a selection of articles from 'Off Our Backs' that are held by JSTOR. I was motivated by [this blog post][id]. I tried with all articles that are one or more pages, but that resulted in some corrupt data (empty CSV files), so the data here are all articles with two or more pages. The methods here come from the R package [JSTORr][id1] and use the article as the basic unit of analysis. The motivation for this is that [JSTOR's DFR][id2] data are provided at the article level, and that does seem to provide interesting data at the scale of multiple thousands of articles. However, this is a lower level of resolution than corpus linguistics methods that use the word as the unit of analysis. So the results here might not compare well with corpus linguistics methods for a small sample of articles.
	```{r, echo=TRUE, message=FALSE, cache=TRUE, warning = FALSE, results='hide'}
	# prepare environment for analysis
	# assign working directory (ie. the location of the
	# downloaded unzipped DFR archive)
	wd <- "C:/Users/marwick//Downloads/2013.6.27.j7MTwUWC_OOB"
	setwd(wd)
	# load package
	# see https://github.com/UW-ARCHY-textual-macroanalysis-lab/JSTORr
	# for details on how to download and get started
	require(JSTORr)
	# bring JSTOR DFR data into R
	unpack1grams <- JSTOR_unpack1grams()
	# process the data to remove unclean text, OCR errors, etc.
	nouns <- JSTOR_dtmofnouns(unpack1grams, POStag = FALSE)
	# I've already removed the most common stopwords, but
	# let's remove a few specific to this archive
	OOB_stopwords <- c("woman", "women", "people",
	"will", "said", "many", "much",
	"time", "world", "movement",
	"page", "conference", "issues",
	"feminist")
	nouns <- nouns[, !(nouns$dimnames$Terms %in% OOB_stopwords) ]
	```
	The document term matrix contains `r length(unpack1grams$wordcounts$dimnames$Docs)` documents and `r length(unpack1grams$wordcounts$dimnames$Terms)` words. This is a relatively small collecton for the methods used here, so we'll proceed with caution.
	\newpage
	Let's see what are the most frequent words over time.
	```{r, echo=TRUE, message=FALSE, fig.height=6, fig.width=8, cache=TRUE, results='hide', warning = FALSE, fig.cap="Most frequent words over time"}
	freq <- JSTOR_freqwords(unpack1grams, nouns)
	```
	\newpage
	Looks like an interesting change in the word black, seems that it's very frequent from 1970 to 1995. Let's have another look at that. We have fewer articles (ie. less dots) after 1997, after which lesbian, rights and children become more frequent. Why whould that be? The mean frequency per article of black, indicated by the blue line, doesn't change much.
	```{r, echo=TRUE, message=FALSE, fig.height=4, cache=TRUE, results='hide', warning = FALSE, fig.cap="Frequency of black over time"}
	JSTOR_1word(unpack1grams, "black")
	```
	\newpage
	Let's see how use of the word black changes over time by investigating strongly correlated words. This may repay more detailed study of the table of words.
	```{r, echo=TRUE, message=FALSE, fig.height=8, fig.width=10, cache=TRUE, warning = FALSE, fig.cap="Words correlated with black over time", results='hide'}
	black <- JSTOR_findassocs(unpack1grams, nouns,
	"black", n = 5, topn = 15)
	```
	\newpage
	Here's the table of highly correlated words
	```{r, echo=TRUE, message=FALSE, cache=TRUE, warning = FALSE, results='asis'}
	require(pander)
	set.caption('Words correlated with black (top 15 words per year interval)')
	black_tab <- black[black$words != "", ]
	pandoc.table(black_tab, style="rmarkdown")
	```
	\newpage
	We can see if there appear to be clusters amongst the documents containing the word black.
	```{r, echo=TRUE, message=FALSE, cache=TRUE, warning = FALSE, results='hide', include=FALSE}
	black_clus <- JSTOR_clusterbywords(nouns, 'black')
	```
	![k-means clusters of documents containing black](/Users/marwick/Downloads/2013.6.27.j7MTwUWC_OOB/figure/bla.png)
	\newpage
	We can see what words define cluster membership, check out cluster 42.
	```{r, echo=TRUE, message=FALSE, cache=TRUE}
	black_clus$kmeans
	```
	\newpage
	And we can see what words are most responsible for variation along the two axes
	```{r, echo=TRUE, message=FALSE, cache=TRUE}
	contrib <- data.frame(black_clus$PCA$var$contrib)
	# x-axis
	row.names(contrib[with(contrib, order(Dim.1)), ][1:20,])
	# y-axis
	row.names(contrib[with(contrib, order(Dim.2)), ][1:20,])
	```
	\newpage
	Let's have a look how a few words vary over time. How many times does the word 'essentialism' occur? In this set of articles that are at least two pages long, this word occurs `r sum(unpack1grams$wordcounts[,unpack1grams$wordcounts$dimnames$Terms == "essentialism"])` times. We can plot the change in frequency of this word over time.
	```{r, echo=TRUE, message=FALSE, fig.height=4, cache=TRUE, ,results='hide', warning = FALSE, fig.cap="Frequency of essentialism over time"}
	JSTOR_1word(unpack1grams, "essentialism", se=FALSE)
	```
	\newpage
	We can plot the top words correlated with 'essentialism' over the duration of the archive.
	```{r, echo=TRUE, message=FALSE, fig.height=8, fig.width=8, cache=TRUE, warning = FALSE, fig.cap="Words correlated with essentialism over time", results='hide'}
	essentialism <- JSTOR_findassocs(unpack1grams, nouns,
	"essentialism", n = 5, topn = 15)
	```
	\newpage
	We can also get a table of some of these highly correlated words. Many of the odd-looking words are due to OCR errors or hyphenation.
	```{r, echo=TRUE, message=FALSE, cache=TRUE, warning = FALSE, results='asis'}
	require(pander)
	set.caption('Words correlated with essentialism (top 15 words per year interval)')
	essentialism_tab <- essentialism[essentialism$words != "", ]
	pandoc.table(essentialism_tab, style="rmarkdown")
	```
	\newpage
	Now let's try 'biological', which occurs `r sum(unpack1grams$wordcounts[,unpack1grams$wordcounts$dimnames$Terms == "biological"])` times in the archive. We can plot the frequency of this word over time and plot and tabulate highly correlated words.
	```{r, echo=TRUE, message=FALSE, fig.height=4, cache=TRUE, warning = FALSE, fig.cap="Frequency of biological over time"}
	JSTOR_1word(unpack1grams, "biological", se=FALSE)
	```
	\newpage
	```{r, echo=TRUE, message=FALSE, fig.height=8, fig.width=8, cache=TRUE, warning = FALSE, fig.cap="Words correlated with biological over time", results='hide'}
	biological <- JSTOR_findassocs(unpack1grams, nouns,
	"biological", n = 5, topn = 15)
	```
	\newpage
	```{r, echo=TRUE, message=FALSE, cache=TRUE, warning = FALSE, results='asis'}
	require(pander)
	set.caption('Words correlated with biological (top 15 words per year interval)')
	biological_tab <- biological[biological$words != "", ]
	pandoc.table(biological_tab, style="rmarkdown")
	```
	\newpage
	We can investigate two-grams such as biological determinism, biological explanation and biological difference.
	```{r, echo=TRUE, message=FALSE, cache=TRUE, warning = FALSE, results='hide', tidy=FALSE}
	# bring 2-grams into R
	unpack2grams <- JSTOR_unpack2grams()
	# make a vector of bigrams
	twogs <- c("biological determinism", "biological explanation",
	"biological difference", "hormonal determinism")
	# calculate abundance of bigrams
	require(plyr)
	counts <- ldply(twogs, function(i)
	colSums(as.matrix(unpack2grams$bigrams[,unpack2grams$bigrams$dimnames$Terms
	== i])))
	rownames(counts) <- twogs
	names(counts) <- "count"
	```
	These terms are not frequent in this archive, so little use for the methods in this package
	```{r, echo=TRUE, message=FALSE, cache=TRUE, warning = FALSE, results='asis'}
	require(pander)
	set.caption('Frequency of select bigrams')
	pandoc.table(counts, style="rmarkdown")
	```
	\newpage
	We can look at the correlation of essentialism and biological. They are positively correlated, as expected, but there are very few data points. This would be more interesting with a larger archive.
	```{r, echo=TRUE, message=FALSE, cache=TRUE, warning = FALSE, results='asis', fig.cap="Correlation of biological and essentialism over time"}
	JSTOR_2wordcor(unpack1grams, "essentialism", "biological")
	````
	\newpage
	We can see if there appear to be clusters amongst the documents containing the word biological.
	```{r, echo=TRUE, message=FALSE, cache=TRUE, warning = FALSE, results='hide', include=FALSE}
	biological_clus <- JSTOR_clusterbywords(nouns, 'biological')
	```
	![k-means clusters of documents containing biological](/Users/marwick/Downloads/2013.6.27.j7MTwUWC_OOB/figure/pca.png)
	\newpage
	We can see what words define cluster membership, check out cluster 10.
	```{r, echo=TRUE, message=FALSE, cache=TRUE}
	biological_clus$kmeans
	```
	\newpage
	And we can see what words are most responsible for variation along the two axes
	```{r, echo=TRUE, message=FALSE, cache=TRUE}
	contrib <- data.frame(biological_clus$PCA$var$contrib)
	# x-axis
	row.names(contrib[with(contrib, order(Dim.1)), ][1:20,])
	# y-axis
	row.names(contrib[with(contrib, order(Dim.2)), ][1:20,])
	```

	So there's a short and quick demonstration of some of the basic functions in teh JSTORr package. All the code is included here so the entire analysis can be replicated by anyone with a basic familiarity with R. Since you've already got a topic modelling workflow, I wont bother with that here.

	Since this area of inquiry is a bit unfamiliar to me, I'm not going to hazard any interpretations, but I hope there's something intriguing in there for you. The methods in the [JSTORr][id1] package are probably at their most effective with a larger archive of articles over a longer time period. Perhaps if several similar journals were put together we might get some more reliable data.

	---------------------
	Ben Marwick, June 2013

	```{r, echo=FALSE, message=FALSE, eval=FALSE}
	# This chunck is to run the code and generate the PDF
	# it will not appear in the PDF
	# Load packages
	setwd(wd) # assumes wd has been set earlier in the doc
	require(knitr)
	require(markdown)

	# process .md and .pdf files (including smart punctuation and grey background of code blocks)
	filen <- "OOD" # name of this markdown file without suffix
	knit(paste0(filen,".md"))
	system(paste0("pandoc -s ", paste0(filen,"-out.md"), " -t latex -o ", paste0(filen,".pdf"), " --highlight-style=tango -S"))
	```

	[id]: http://historyinthecity.blogspot.com/2013/06/lessons-in-antconc-3-with-heather.html
	[id1]: https://github.com/UW-ARCHY-textual-macroanalysis-lab/JSTORr
	[id2]: http://dfr.jstor.org/