Skip to content

Instantly share code, notes, and snippets.

@adamlauretig
Created September 5, 2018 15:26
Show Gist options
  • Save adamlauretig/d15381b562881563e97e1e922ee37920 to your computer and use it in GitHub Desktop.
Save adamlauretig/d15381b562881563e97e1e922ee37920 to your computer and use it in GitHub Desktop.
---
title: "Using Gensim in R"
author: "Adam Lauretig"
date: "3/17/2018"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Introduction
[Gensim](https://radimrehurek.com/gensim/) is a powerful Python library for text modeling. It incorporates a variety of models, many of which are not available in R. However, the recently developed [reticulate](https://cran.r-project.org/web/packages/reticulate/index.html) package provides a solution for the R user who wants to dip their toes in, without learning Python. It allows the user to call Python code which behaves like R code, and to seamlessly pass R and Python objects back and forth. In this document, I will show how to install `gensim`, call it from `reticulate`, estimate word embeddings, and perform vector arithmetic.
# Setup
I use Python 3.6, as distributed with [Anaconda](https://www.anaconda.com/download/#macos), and once Anaconda is installed, I install Gensim at the command line, using pip. To do this, I type
```{bash, eval = FALSE}
pip install gensim
```
at the terminal.
I assume you are using recent versions of R, and RStudio. To install thereticulate package from CRAN:
```{r, eval = FALSE}
install.packages("reticulate")
```
We'll also use the quanteda and stringr packages, to install them:
```{r, eval = FALSE}
install.packages(c("quanteda", "stringr"))
```
# Loading Gensim
Importing Gensim with reticulate is very similar to loading an R package more generally:
```{r, eval = TRUE}
library(reticulate)
gensim <- import("gensim") # import the gensim library
Word2Vec <- gensim$models$Word2Vec # Extract the Word2Vec model
multiprocessing <- import("multiprocessing") # For parallel processing
```
In gensim, we extract the `Word2Vec` object from the `models` object, using the `$` operator. Thanks to reticulate, object-oriented nature of python is changed into something R users can recognize, and we can treat `Word2vec` as we would any other R function
# Prepping the data
As an example model, we'll use the text of inauguration speeches from the `quanteda` package. we just want to extract the text we'll use to a character vector, which has 58 elements. We'll lowercase all of the tokens involved, remove punctuation, and then collapse the resulting `tokens` object back into a set of character vectors.
```{r, eval = TRUE}
library(quanteda)
library(stringr)
txt_to_use <- quanteda::data_corpus_inaugural$documents$texts
txt_to_use <- tolower(txt_to_use)
txt_to_use <- stringr::str_replace_all(txt_to_use, "[[:punct:]]", "")
txt_to_use <- stringr::str_replace_all(txt_to_use, "\n", " ")
txt_to_use <- (str_split(txt_to_use, " "))
```
# Creating Word2vec
In python, unlike `R`, we create the model we want to run *before* we run it, supplying it with the various parameters it will take. We'll create an object called `basemodel`, which uses the skip-gram w/negative sampling implementation of *word2vec*. We'll use a window size of 5, considering words within five words of each side of a target word. We'll do 3 sweeps through the data, but in practice, you should do more. We'll tell gensim to use skipgram "`sg`" with negative sampling "`ns`", rather than the hierarchical softmax. Finally, we'll use a dimensionality of 25, for the embedding dimensions, but again, in practice, you should probably use more.
```{r, eval= TRUE}
basemodel = Word2Vec(
workers = 1, # using 1 core
window = 5L,
iter = 3L, # iter = sweeps of SGD through the data; more is better
sg = 1L,
hs = 0L, negative = 1L, # we only have scoring for the hierarchical softmax setup
size = 25L
)
```
# Training the model
To train the model, we'll first build a vocabulary from the inaugural speeches we cleaned earlier. We'll then call the `train` object from `basemodel`, the way you would call an object in `R`.
```{r, eval=TRUE}
basemodel$build_vocab(sentences = txt_to_use)
basemodel$train(
sentences = txt_to_use,
epochs = basemodel$iter,
total_examples = basemodel$corpus_count)
```
# Examining the Results
We can examine the output from the model, this will produce a vector 25 long. This, however, is not particularly informative.
```{r, eval = TRUE}
basemodel$wv$word_vec("united")
```
But that isn't particularly informative. Instead, thanks to `reticulate`'s ability to communicate between `R` and python, we can bring the vectors into R, and then calculate cosine distance (a measure of word similarity).
```{r, eval = TRUE}
library(Matrix)
embeds <- basemodel$wv$syn0
rownames(embeds) <- basemodel$wv$index2word
# function for cosine distance
closest_vector <- function(vec1, mat1){
vec1 <- Matrix(vec1, nrow = 1, ncol = length(vec1))
mat1 <- Matrix(mat1)
mat_magnitudes <- rowSums(mat1^2)
vec_magnitudes <- rowSums(vec1^2)
sim <- (t(tcrossprod(vec1, mat1)/
(sqrt(tcrossprod(vec_magnitudes, mat_magnitudes)))))
sim2 <- matrix(sim, dimnames = list(rownames(sim)))
w <- sim2[order(-sim2),,drop = FALSE]
w[1:10,]
}
closest_vector(embeds["united", ], embeds)
closest_vector(embeds["united", ] + embeds["states", ], embeds)
```
This result isn't bad, for such a small corpus, with relatively few vectors. We can even do more complicated vector arithmetic:
```{r}
closest_vector(embeds["american", ] - embeds["war", ], embeds)
```
# Conclusion
Overall, this is an introduction to `reticulate`, and to estimating word embeddings with gensim. I showed how to prep text, estimate embeddings, and perform vector arithmetic on these embeddings.
@proverbs2323
Copy link

proverbs2323 commented Aug 31, 2020

Thanks for the good work, but when I run this chunk:
basemodel$build_vocab(sentences = txt_to_use)

basemodel$train( sentences = txt_to_use, epochs = basemodel$iter, total_examples = basemodel$corpus_count)

I got this error message:
Error in py_call_impl(callable, dots$args, dots$keywords) : RuntimeError: you must first build vocabulary before training the model

@emoro
Copy link

emoro commented May 28, 2021

Thanks for this script. @proverbs2323, I solve that problem removing the "sentences =" in the build_vocab and train functions. Apparently the new version of gensim has changed the names of the parameters

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment