Skip to content

Instantly share code, notes, and snippets.

@benmarwick
Last active December 19, 2015 03:28
Show Gist options
  • Save benmarwick/5890188 to your computer and use it in GitHub Desktop.
Save benmarwick/5890188 to your computer and use it in GitHub Desktop.
Quick and rough exploration of the JSTOR archive of 'Off Our Backs' using JSTORr. Document is in markdown and the last code block has code to make a PDF.
Quick look at 'Off Our Backs' with some of text mining functions in the R package [JSTORr][id1]
========================================================
Here I demonstrate some basic text mining of a selection of articles from 'Off Our Backs' that are held by JSTOR. I was motivated by [this blog post][id]. I tried with all articles that are one or more pages, but that resulted in some corrupt data (empty CSV files), so the data here are all articles with two or more pages. The methods here come from the R package [JSTORr][id1] and use the article as the basic unit of analysis. The motivation for this is that [JSTOR's DFR][id2] data are provided at the article level, and that does seem to provide interesting data at the scale of multiple thousands of articles. However, this is a lower level of resolution than corpus linguistics methods that use the word as the unit of analysis. So the results here might not compare well with corpus linguistics methods for a small sample of articles.
```{r, echo=TRUE, message=FALSE, cache=TRUE, warning = FALSE, results='hide'}
# prepare environment for analysis
# assign working directory (ie. the location of the
# downloaded unzipped DFR archive)
wd <- "C:/Users/marwick//Downloads/2013.6.27.j7MTwUWC_OOB"
setwd(wd)
# load package
# see https://github.com/UW-ARCHY-textual-macroanalysis-lab/JSTORr
# for details on how to download and get started
require(JSTORr)
# bring JSTOR DFR data into R
unpack1grams <- JSTOR_unpack1grams()
# process the data to remove unclean text, OCR errors, etc.
nouns <- JSTOR_dtmofnouns(unpack1grams, POStag = FALSE)
# I've already removed the most common stopwords, but
# let's remove a few specific to this archive
OOB_stopwords <- c("woman", "women", "people",
"will", "said", "many", "much",
"time", "world", "movement",
"page", "conference", "issues",
"feminist")
nouns <- nouns[, !(nouns$dimnames$Terms %in% OOB_stopwords) ]
```
The document term matrix contains `r length(unpack1grams$wordcounts$dimnames$Docs)` documents and `r length(unpack1grams$wordcounts$dimnames$Terms)` words. This is a relatively small collecton for the methods used here, so we'll proceed with caution.
\newpage
Let's see what are the most frequent words over time.
```{r, echo=TRUE, message=FALSE, fig.height=6, fig.width=8, cache=TRUE, results='hide', warning = FALSE, fig.cap="Most frequent words over time"}
freq <- JSTOR_freqwords(unpack1grams, nouns)
```
\newpage
Looks like an interesting change in the word *black*, seems that it's very frequent from 1970 to 1995. Let's have another look at that. We have fewer articles (ie. less dots) after 1997, after which *lesbian*, *rights* and *children* become more frequent. Why whould that be? The mean frequency per article of *black*, indicated by the blue line, doesn't change much.
```{r, echo=TRUE, message=FALSE, fig.height=4, cache=TRUE, results='hide', warning = FALSE, fig.cap="Frequency of *black* over time"}
JSTOR_1word(unpack1grams, "black")
```
\newpage
Let's see how use of the word *black* changes over time by investigating strongly correlated words. This may repay more detailed study of the table of words.
```{r, echo=TRUE, message=FALSE, fig.height=8, fig.width=10, cache=TRUE, warning = FALSE, fig.cap="Words correlated with *black* over time", results='hide'}
black <- JSTOR_findassocs(unpack1grams, nouns,
"black", n = 5, topn = 15)
```
\newpage
Here's the table of highly correlated words
```{r, echo=TRUE, message=FALSE, cache=TRUE, warning = FALSE, results='asis'}
require(pander)
set.caption('Words correlated with *black* (top 15 words per year interval)')
black_tab <- black[black$words != "", ]
pandoc.table(black_tab, style="rmarkdown")
```
\newpage
We can see if there appear to be clusters amongst the documents containing the word *black*.
```{r, echo=TRUE, message=FALSE, cache=TRUE, warning = FALSE, results='hide', include=FALSE}
black_clus <- JSTOR_clusterbywords(nouns, 'black')
```
![k-means clusters of documents containing *black*](/Users/marwick/Downloads/2013.6.27.j7MTwUWC_OOB/figure/bla.png)
\newpage
We can see what words define cluster membership, check out cluster 42.
```{r, echo=TRUE, message=FALSE, cache=TRUE}
black_clus$kmeans
```
\newpage
And we can see what words are most responsible for variation along the two axes
```{r, echo=TRUE, message=FALSE, cache=TRUE}
contrib <- data.frame(black_clus$PCA$var$contrib)
# x-axis
row.names(contrib[with(contrib, order(Dim.1)), ][1:20,])
# y-axis
row.names(contrib[with(contrib, order(Dim.2)), ][1:20,])
```
\newpage
Let's have a look how a few words vary over time. How many times does the word 'essentialism' occur? In this set of articles that are at least two pages long, this word occurs `r sum(unpack1grams$wordcounts[,unpack1grams$wordcounts$dimnames$Terms == "essentialism"])` times. We can plot the change in frequency of this word over time.
```{r, echo=TRUE, message=FALSE, fig.height=4, cache=TRUE, ,results='hide', warning = FALSE, fig.cap="Frequency of *essentialism* over time"}
JSTOR_1word(unpack1grams, "essentialism", se=FALSE)
```
\newpage
We can plot the top words correlated with 'essentialism' over the duration of the archive.
```{r, echo=TRUE, message=FALSE, fig.height=8, fig.width=8, cache=TRUE, warning = FALSE, fig.cap="Words correlated with *essentialism* over time", results='hide'}
essentialism <- JSTOR_findassocs(unpack1grams, nouns,
"essentialism", n = 5, topn = 15)
```
\newpage
We can also get a table of some of these highly correlated words. Many of the odd-looking words are due to OCR errors or hyphenation.
```{r, echo=TRUE, message=FALSE, cache=TRUE, warning = FALSE, results='asis'}
require(pander)
set.caption('Words correlated with *essentialism* (top 15 words per year interval)')
essentialism_tab <- essentialism[essentialism$words != "", ]
pandoc.table(essentialism_tab, style="rmarkdown")
```
\newpage
Now let's try 'biological', which occurs `r sum(unpack1grams$wordcounts[,unpack1grams$wordcounts$dimnames$Terms == "biological"])` times in the archive. We can plot the frequency of this word over time and plot and tabulate highly correlated words.
```{r, echo=TRUE, message=FALSE, fig.height=4, cache=TRUE, warning = FALSE, fig.cap="Frequency of *biological* over time"}
JSTOR_1word(unpack1grams, "biological", se=FALSE)
```
\newpage
```{r, echo=TRUE, message=FALSE, fig.height=8, fig.width=8, cache=TRUE, warning = FALSE, fig.cap="Words correlated with *biological* over time", results='hide'}
biological <- JSTOR_findassocs(unpack1grams, nouns,
"biological", n = 5, topn = 15)
```
\newpage
```{r, echo=TRUE, message=FALSE, cache=TRUE, warning = FALSE, results='asis'}
require(pander)
set.caption('Words correlated with *biological* (top 15 words per year interval)')
biological_tab <- biological[biological$words != "", ]
pandoc.table(biological_tab, style="rmarkdown")
```
\newpage
We can investigate two-grams such as *biological determinism*, *biological explanation* and *biological difference*.
```{r, echo=TRUE, message=FALSE, cache=TRUE, warning = FALSE, results='hide', tidy=FALSE}
# bring 2-grams into R
unpack2grams <- JSTOR_unpack2grams()
# make a vector of bigrams
twogs <- c("biological determinism", "biological explanation",
"biological difference", "hormonal determinism")
# calculate abundance of bigrams
require(plyr)
counts <- ldply(twogs, function(i)
colSums(as.matrix(unpack2grams$bigrams[,unpack2grams$bigrams$dimnames$Terms
== i])))
rownames(counts) <- twogs
names(counts) <- "count"
```
These terms are not frequent in this archive, so little use for the methods in this package
```{r, echo=TRUE, message=FALSE, cache=TRUE, warning = FALSE, results='asis'}
require(pander)
set.caption('Frequency of select bigrams')
pandoc.table(counts, style="rmarkdown")
```
\newpage
We can look at the correlation of *essentialism* and *biological*. They are positively correlated, as expected, but there are very few data points. This would be more interesting with a larger archive.
```{r, echo=TRUE, message=FALSE, cache=TRUE, warning = FALSE, results='asis', fig.cap="Correlation of *biological* and *essentialism* over time"}
JSTOR_2wordcor(unpack1grams, "essentialism", "biological")
````
\newpage
We can see if there appear to be clusters amongst the documents containing the word *biological*.
```{r, echo=TRUE, message=FALSE, cache=TRUE, warning = FALSE, results='hide', include=FALSE}
biological_clus <- JSTOR_clusterbywords(nouns, 'biological')
```
![k-means clusters of documents containing *biological*](/Users/marwick/Downloads/2013.6.27.j7MTwUWC_OOB/figure/pca.png)
\newpage
We can see what words define cluster membership, check out cluster 10.
```{r, echo=TRUE, message=FALSE, cache=TRUE}
biological_clus$kmeans
```
\newpage
And we can see what words are most responsible for variation along the two axes
```{r, echo=TRUE, message=FALSE, cache=TRUE}
contrib <- data.frame(biological_clus$PCA$var$contrib)
# x-axis
row.names(contrib[with(contrib, order(Dim.1)), ][1:20,])
# y-axis
row.names(contrib[with(contrib, order(Dim.2)), ][1:20,])
```
So there's a short and quick demonstration of some of the basic functions in teh JSTORr package. All the code is included here so the entire analysis can be replicated by anyone with a basic familiarity with R. Since you've already got a topic modelling workflow, I wont bother with that here.
Since this area of inquiry is a bit unfamiliar to me, I'm not going to hazard any interpretations, but I hope there's something intriguing in there for you. The methods in the [JSTORr][id1] package are probably at their most effective with a larger archive of articles over a longer time period. Perhaps if several similar journals were put together we might get some more reliable data.
---------------------
Ben Marwick, June 2013
```{r, echo=FALSE, message=FALSE, eval=FALSE}
# This chunck is to run the code and generate the PDF
# it will not appear in the PDF
# Load packages
setwd(wd) # assumes wd has been set earlier in the doc
require(knitr)
require(markdown)
# process .md and .pdf files (including smart punctuation and grey background of code blocks)
filen <- "OOD" # name of this markdown file without suffix
knit(paste0(filen,".md"))
system(paste0("pandoc -s ", paste0(filen,"-out.md"), " -t latex -o ", paste0(filen,".pdf"), " --highlight-style=tango -S"))
```
[id]: http://historyinthecity.blogspot.com/2013/06/lessons-in-antconc-3-with-heather.html
[id1]: https://github.com/UW-ARCHY-textual-macroanalysis-lab/JSTORr
[id2]: http://dfr.jstor.org/
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment