Articles

2014 - Word2vec Explained: Deriving Mikolov Et Al.’s Negative-Sampling Word-Embedding Method

State “DONE” from “TODO” [2016-12-13 Tue 09:34]

cite:goldberg-2014-word2-explain

Notes for page 2

Assumption

The words and the contexts com from distinct vocabularies, so that, the vector associated with the word dog will be different from the vector associated with the context dog.
Maximizing objective
will result in good embeddings \(v_w\), \(∀ w ∈ V\), in the sense that similar words will have similar

vectors.

Notes for page 3

To prevent all the vectors from having the same value, one way is to present the model with some \((w, c)\) pairs for which \(p(D=1| w, c; θ)\) must be low, i.e. pairs which are not in the data. This is achieved by negative sampling.

Notes for page 4

Unlike the origin Skip-gram model, we does not model \(p(c| w)\) by instead model a quantity related to the join distribution of \(w\) and \(c\).
The model is non-convex when the words and contexts representations are learned jointly. If we fix the words representation and learn only the contexts representations, or fix the contexts representation and learn only the word representations, the model reduced to logistic regression, and is convex.
Sampling details.

Notes for page 5

Peculiarities of the contexts

Dynamic window size

The parameter \(k\) denotes the maximal window size. For each word in the corpus, a window size \(k’\) is sampled uniformly from \(1, …, k\).

Effect of subsampling and rare-word pruning

Words appearing less that min-count times are not considered as either words or contexts.
Frequent words are down-sampled for frequent words are less informative.

Here we see another explanation for its effec- tiveness: the effective window size grows, including context-words which are both content-full and linearly far away from the focus word, thus making the similarities more topical.

Importantly, these words are removed from the text before generating the contexts. This has the effect of increasing the effective window size for a certain words.

Why does this produce good word representations?

The distributional hypothesis states that words in similar contexts have similar meanings. The objective above clearly tries to increase the quantity \(v_w^\topv_c\) for good word-context pairs, and decrease it for bad ones. Intuitively, this means that words that share many contexts will be similar to each other (note also that contexts sharing many words will also be similar to each other). This is, however, very hand-wavy.

2010 - Word representations: a simple and general method for semi-supervised learning

Yevgnen/test.org