adamlauretig · September 5, 2018 15:26 · emoro · May 28, 2021
diff --git a/gensim_in_r.Rmd b/gensim_in_r.Rmd
 ---
 title: "Using Gensim in R"
 author: "Adam Lauretig"
 date: "3/17/2018"
 output: html_document
 ---

 ```{r setup, include=FALSE}
 knitr::opts_chunk$set(echo = TRUE)
 ```

 # Introduction

 [Gensim](https://radimrehurek.com/gensim/) is a powerful Python library for text modeling. It incorporates a variety of models, many of which are not available in R. However, the recently developed [reticulate](https://cran.r-project.org/web/packages/reticulate/index.html) package provides a solution for the R user who wants to dip their toes in, without learning Python.  It allows the user to call Python code which behaves like R code, and to seamlessly pass R and Python objects back and forth. In this document, I will show how to install `gensim`, call it from `reticulate`, estimate word embeddings, and perform vector arithmetic.

 # Setup

 I use Python 3.6, as distributed with [Anaconda](https://www.anaconda.com/download/#macos), and once Anaconda is installed, I install Gensim at the command line, using pip. To do this, I type 
 ```{bash, eval = FALSE}
 pip install gensim
 ```
 at the terminal.

 I assume you are using recent versions of R, and RStudio. To install thereticulate package from CRAN:
 ```{r, eval = FALSE}
 install.packages("reticulate")
 ```

 We'll also use the quanteda and stringr packages, to install them:
 ```{r, eval = FALSE}
 install.packages(c("quanteda", "stringr"))
 ```

 # Loading Gensim

 Importing Gensim with reticulate is very similar to loading an R package more generally:
 ```{r, eval = TRUE}
 library(reticulate)
 gensim <- import("gensim") # import the gensim library
 Word2Vec <- gensim$models$Word2Vec # Extract the Word2Vec model
 multiprocessing <- import("multiprocessing") # For parallel processing

 ```
 In gensim, we extract the `Word2Vec` object from the `models` object, using the `$` operator. Thanks to reticulate, object-oriented nature of python is changed into something R users can recognize, and we can treat `Word2vec` as we would any other R function

 # Prepping the data

 As an example model, we'll use the text of inauguration speeches from the `quanteda` package. we just want to extract the text we'll use to a character vector, which has 58 elements. We'll lowercase all of the tokens involved, remove punctuation, and then collapse the resulting `tokens` object back into a set of character vectors.

 ```{r, eval = TRUE}
 library(quanteda)
 library(stringr)
 txt_to_use <- quanteda::data_corpus_inaugural$documents$texts
 txt_to_use <- tolower(txt_to_use)
 txt_to_use <- stringr::str_replace_all(txt_to_use, "[[:punct:]]", "")
 txt_to_use <- stringr::str_replace_all(txt_to_use, "\n", " ")
 txt_to_use <- (str_split(txt_to_use, " "))

 ```


 # Creating Word2vec

 In python, unlike `R`, we create the model we want to run *before* we run it, supplying it with the various parameters it will take. We'll create an object called `basemodel`, which uses the skip-gram w/negative sampling implementation of *word2vec*. We'll use a window size of 5, considering words within five words of each side of a target word. We'll do 3 sweeps through the data, but in practice, you should do more. We'll tell gensim to use skipgram "`sg`" with negative sampling "`ns`", rather than the hierarchical softmax. Finally, we'll use a dimensionality of 25, for the embedding dimensions, but again, in practice, you should probably use more.

 ```{r, eval= TRUE}

 basemodel = Word2Vec(
    workers = 1, # using 1 core
    window = 5L,
    iter = 3L, # iter = sweeps of SGD through the data; more is better
    sg = 1L,
    hs = 0L, negative = 1L, # we only have scoring for the hierarchical softmax setup
    size = 25L
 )
 ```


 # Training the model

 To train the model, we'll first build a vocabulary from the inaugural speeches we cleaned earlier. We'll then call the `train` object from `basemodel`, the way you would call an object in `R`. 

 ```{r, eval=TRUE}


 basemodel$build_vocab(sentences = txt_to_use)
 basemodel$train(
  sentences = txt_to_use,
  epochs = basemodel$iter, 
  total_examples = basemodel$corpus_count)

 ```

 # Examining the Results

 We can examine the output from the model, this will produce a vector 25 long. This, however, is not particularly informative.

 ```{r, eval = TRUE}
 basemodel$wv$word_vec("united")
 ```

 But that isn't particularly informative. Instead, thanks to `reticulate`'s ability to communicate between `R` and python, we can bring the vectors into R, and then calculate cosine distance (a measure of word similarity).

 ```{r, eval = TRUE}
 library(Matrix)
 embeds <- basemodel$wv$syn0
 rownames(embeds) <- basemodel$wv$index2word


 # function for cosine distance
 closest_vector <- function(vec1, mat1){
  vec1 <- Matrix(vec1, nrow = 1, ncol = length(vec1))
  mat1 <- Matrix(mat1)
  mat_magnitudes <- rowSums(mat1^2)
  vec_magnitudes <- rowSums(vec1^2)
  sim <- (t(tcrossprod(vec1, mat1)/
      (sqrt(tcrossprod(vec_magnitudes, mat_magnitudes)))))
  sim2 <- matrix(sim, dimnames = list(rownames(sim)))
  
  w <- sim2[order(-sim2),,drop = FALSE]
  w[1:10,]
 }
  
 closest_vector(embeds["united", ], embeds)

 closest_vector(embeds["united", ] + embeds["states", ], embeds)


 ```

 This result isn't bad, for such a small corpus, with relatively few vectors. We can even do more complicated vector arithmetic:

 ```{r}
 closest_vector(embeds["american", ] - embeds["war", ], embeds)

 ```

 # Conclusion

 Overall, this is an introduction to `reticulate`, and to estimating word embeddings with gensim. I showed how to prep text, estimate embeddings, and perform vector arithmetic on these embeddings.
	---
	title: "Using Gensim in R"
	author: "Adam Lauretig"
	date: "3/17/2018"
	output: html_document
	---

	```{r setup, include=FALSE}
	knitr::opts_chunk$set(echo = TRUE)
	```

	# Introduction

	[Gensim](https://radimrehurek.com/gensim/) is a powerful Python library for text modeling. It incorporates a variety of models, many of which are not available in R. However, the recently developed [reticulate](https://cran.r-project.org/web/packages/reticulate/index.html) package provides a solution for the R user who wants to dip their toes in, without learning Python. It allows the user to call Python code which behaves like R code, and to seamlessly pass R and Python objects back and forth. In this document, I will show how to install `gensim`, call it from `reticulate`, estimate word embeddings, and perform vector arithmetic.

	# Setup

	I use Python 3.6, as distributed with [Anaconda](https://www.anaconda.com/download/#macos), and once Anaconda is installed, I install Gensim at the command line, using pip. To do this, I type
	```{bash, eval = FALSE}
	pip install gensim
	```
	at the terminal.

	I assume you are using recent versions of R, and RStudio. To install thereticulate package from CRAN:
	```{r, eval = FALSE}
	install.packages("reticulate")
	```

	We'll also use the quanteda and stringr packages, to install them:
	```{r, eval = FALSE}
	install.packages(c("quanteda", "stringr"))
	```

	# Loading Gensim

	Importing Gensim with reticulate is very similar to loading an R package more generally:
	```{r, eval = TRUE}
	library(reticulate)
	gensim <- import("gensim") # import the gensim library
	Word2Vec <- gensim$models$Word2Vec # Extract the Word2Vec model
	multiprocessing <- import("multiprocessing") # For parallel processing

	```
	In gensim, we extract the `Word2Vec` object from the `models` object, using the `$` operator. Thanks to reticulate, object-oriented nature of python is changed into something R users can recognize, and we can treat `Word2vec` as we would any other R function

	# Prepping the data

	As an example model, we'll use the text of inauguration speeches from the `quanteda` package. we just want to extract the text we'll use to a character vector, which has 58 elements. We'll lowercase all of the tokens involved, remove punctuation, and then collapse the resulting `tokens` object back into a set of character vectors.

	```{r, eval = TRUE}
	library(quanteda)
	library(stringr)
	txt_to_use <- quanteda::data_corpus_inaugural$documents$texts
	txt_to_use <- tolower(txt_to_use)
	txt_to_use <- stringr::str_replace_all(txt_to_use, "[[:punct:]]", "")
	txt_to_use <- stringr::str_replace_all(txt_to_use, "\n", " ")
	txt_to_use <- (str_split(txt_to_use, " "))

	```


	# Creating Word2vec

	In python, unlike `R`, we create the model we want to run before we run it, supplying it with the various parameters it will take. We'll create an object called `basemodel`, which uses the skip-gram w/negative sampling implementation of word2vec. We'll use a window size of 5, considering words within five words of each side of a target word. We'll do 3 sweeps through the data, but in practice, you should do more. We'll tell gensim to use skipgram "`sg`" with negative sampling "`ns`", rather than the hierarchical softmax. Finally, we'll use a dimensionality of 25, for the embedding dimensions, but again, in practice, you should probably use more.

	```{r, eval= TRUE}

	basemodel = Word2Vec(
	workers = 1, # using 1 core
	window = 5L,
	iter = 3L, # iter = sweeps of SGD through the data; more is better
	sg = 1L,
	hs = 0L, negative = 1L, # we only have scoring for the hierarchical softmax setup
	size = 25L
	)
	```


	# Training the model

	To train the model, we'll first build a vocabulary from the inaugural speeches we cleaned earlier. We'll then call the `train` object from `basemodel`, the way you would call an object in `R`.

	```{r, eval=TRUE}


	basemodel$build_vocab(sentences = txt_to_use)
	basemodel$train(
	sentences = txt_to_use,
	epochs = basemodel$iter,
	total_examples = basemodel$corpus_count)

	```

	# Examining the Results

	We can examine the output from the model, this will produce a vector 25 long. This, however, is not particularly informative.

	```{r, eval = TRUE}
	basemodel$wv$word_vec("united")
	```

	But that isn't particularly informative. Instead, thanks to `reticulate`'s ability to communicate between `R` and python, we can bring the vectors into R, and then calculate cosine distance (a measure of word similarity).

	```{r, eval = TRUE}
	library(Matrix)
	embeds <- basemodel$wv$syn0
	rownames(embeds) <- basemodel$wv$index2word


	# function for cosine distance
	closest_vector <- function(vec1, mat1){
	vec1 <- Matrix(vec1, nrow = 1, ncol = length(vec1))
	mat1 <- Matrix(mat1)
	mat_magnitudes <- rowSums(mat1^2)
	vec_magnitudes <- rowSums(vec1^2)
	sim <- (t(tcrossprod(vec1, mat1)/
	(sqrt(tcrossprod(vec_magnitudes, mat_magnitudes)))))
	sim2 <- matrix(sim, dimnames = list(rownames(sim)))

	w <- sim2[order(-sim2),,drop = FALSE]
	w[1:10,]
	}

	closest_vector(embeds["united", ], embeds)

	closest_vector(embeds["united", ] + embeds["states", ], embeds)


	```

	This result isn't bad, for such a small corpus, with relatively few vectors. We can even do more complicated vector arithmetic:

	```{r}
	closest_vector(embeds["american", ] - embeds["war", ], embeds)

	```

	# Conclusion

	Overall, this is an introduction to `reticulate`, and to estimating word embeddings with gensim. I showed how to prep text, estimate embeddings, and perform vector arithmetic on these embeddings.