-
-
Save tjvananne/8b0e7df7dcad414e8e6d5bf3947439a9 to your computer and use it in GitHub Desktop.
#' A word vector is a giant matrix of words, and each word contains a numeric array that represents the semantic | |
#' meaning of that word. This is useful so we can discover relationships and analogies between words programmatically. | |
#' The classic example is "king" minus "man" plus "woman" is most similar to "queen" | |
# function definition -------------------------------------------------------------------------- | |
# input .txt file, exports list of list of values and character vector of names (words) | |
proc_pretrained_vec <- function(p_vec) { | |
# initialize space for values and the names of each word in vocab | |
vals <- vector(mode = "list", length(p_vec)) | |
names <- character(length(p_vec)) | |
# loop through to gather values and names of each word | |
for(i in 1:length(p_vec)) { | |
if(i %% 1000 == 0) {print(i)} | |
this_vec <- p_vec[i] | |
this_vec_unlisted <- unlist(strsplit(this_vec, " ")) | |
this_vec_values <- as.numeric(this_vec_unlisted[-1]) # this needs testing, does it become numeric? | |
this_vec_name <- this_vec_unlisted[1] | |
vals[[i]] <- this_vec_values | |
names[[i]] <- this_vec_name | |
} | |
# convert lists to data.frame and attach the names | |
glove <- data.frame(vals) | |
names(glove) <- names | |
return(glove) | |
} | |
# using the function ------------------------------------------------------------------------- | |
# here we are reading in the unzipped, raw, GloVe pre-trained word vector object (.txt) | |
# all you have to change is the file path to where you GloVe object has been unzipped | |
g6b_300 <- scan(file = "LARGE_FILES_pre_trained/glove.6B.300d.txt", what="", sep="\n") | |
# call the function to convert the raw GloVe vector to data.frame (extra lines are for wall-time reporting) | |
t_temp <- Sys.time() | |
glove.300 <- proc_pretrained_vec(g6b_300) # this is the actual function call | |
(t_elap_temp <- paste0(round(as.numeric(Sys.time() - t_temp, units="mins"), digits = 2), " minutes")) | |
print(dim(glove.300)) | |
# [1] 300 400000 | |
# NOTES: ------------------------------------------------------------------------------------------ | |
#' I chose to use the 6 billion token, 300-dimension-per-word, 400k vocabulary word vector, so that | |
#' explains why the dimensions of this dataframe are 300 rows by 400k columns | |
#' | |
#' each column is a different word's numeric vector representation. it might be useful to t(glove.300) | |
#' to transpose into a matrix for some calculations like sim2 from text2vec package | |
# BONUS MATERIAL: definition for finding similar word vectors ---------------------------------------------- | |
# let's have some fun with this and try out the most common examples | |
# this section requires the "text2vec" library | |
# install.packages("text2vec") # uncomment and execute this if you don't have that package | |
find_sim_wvs <- function(this_wv, all_wvs, top_n_res=40) { | |
# this_wv will be a numeric vector; all_wvs will be a data.frame with words as columns and dimesions as rows | |
require(text2vec) | |
this_wv_mat <- matrix(this_wv, ncol=length(this_wv), nrow=1) | |
all_wvs_mat <- as.matrix(all_wvs) | |
if(dim(this_wv_mat)[[2]] != dim(all_wvs_mat)[[2]]) { | |
print("switching dimensions on the all_wvs_matrix") | |
all_wvs_mat <- t(all_wvs_mat) | |
} | |
cos_sim = sim2(x=all_wvs_mat, y=this_wv_mat, method="cosine", norm="l2") | |
sorted_cos_sim <- sort(cos_sim[,1], decreasing = T) | |
return(head(sorted_cos_sim, top_n_res)) | |
} | |
# try out the function - we're hoping that "queen" will be in the top 5 results here | |
this_word_vector <- glove.300[['king']] - glove.300[['man']] + glove.300[['woman']] | |
find_sim_wvs(this_word_vector, glove.300, top_n_res=5) | |
# "flock is to geese as bison is to ___________" (hoping for "herd") | |
# funny... "buffalo" tends to gravitate towards the city while "bison" is the animal | |
my_wv <- glove.300[['flock']] - glove.300[['geese']] + glove.300[['buffalo']] # all cities because "buffalo, NY" | |
find_sim_wvs(my_wv, glove.300, top_n_res=10) | |
my_wv <- glove.300[['flock']] - glove.300[['geese']] + glove.300[['bison']] # here we go, we got our "herds" we're looking for | |
find_sim_wvs(my_wv, glove.300, top_n_res=10) | |
Hey @PD1994,
Haven't had a chance to test this old code in a while, so results may vary. I know one naive strategy people have used when working with sentences is to take the element-wise average of all words within the sentence (so if you're using 300 dimension word vectors, you'd end up with one 300 dimension vector that represents your sentence). I would imagine you could do something similar for bigrams and trigrams. You'd have to also think about at what point (if at all) you want to remove stop words. That decision gets a bit trickier once you start working with bigrams and trigrams.
I think the next step up in complexity (and presumably, accuracy) would be dipping into sequence based models like LSTMs and BERT. Those topics go far beyond the simple gist I wrote here though. Huggingface might be a good resource to checkout if you're looking for some state of the art NLP. Hope that helps!
-Taylor
Hi @tjvananne,
thank you very much for your reply, I appreciate any help or hints. Yes I already thought about this variant too, but I don't know if this really hits the point of my analysis. Also, I want to use bigrams for input as well as get it as output. I have trained a model by myself where this can be used quite well, but how to do this for the pre-trained models is just unknown to me.
Thanks also for your hint regarding other more complex possibilities. I have already played with the idea of using BERT. However, I have been using R for a very short time, which makes me afraid that I won't be able to implement it. Do you have any helpful contributions or codes for this?
As an alternative, I have also thought about ELMo, for which I have partly already found some trained models...
Best regards and thanks again.
Hello, since the data is way too big for my computer, I used g6b_300 <-data.table::fread('glove_s300.txt', data.table = F, encoding = 'UTF-8', header = F)
to read it. But when I run glove.300 <- proc_pretrained_vec(g6b_300)
R returns this error Error in strsplit(this_vec, " ") : non-character argument
.
There's another topic worth mentioning. I got that data: Wikipedia 2014 + Gigaword 5 (6B tokens, 400K vocab, uncased, 300d vectors, 822 MB download) from https://github.com/stanfordnlp/GloVe . Is that correct?
Edit: searching for answers I came to suspect that all columns from g6b_300
must be character type. I will check this out but doesn't sounds right.
Hi @tjvananne,
thank you very much for your post. I am still relatively inexperienced in R and am therefore looking for some help. Is there also the possibility to use the pre-trained model for bi- and trigram and how would one have to change the code for this?
Thanks in advance and best regards