- State “DONE” from “TODO” [2016-12-13 Tue 09:34]
cite:goldberg-2014-word2-explain
- The words and the contexts com from distinct vocabularies, so that, the
vector associated with the word
dog
will be different from the vector associated with the contextdog
. - Maximizing objective
will result in good embeddings \(vw\), \(∀ w ∈ V\), in the sense that similar words will have similar
vectors.
- To prevent all the vectors from having the same value, one way is to present the model with some \((w, c)\) pairs for which \(p(D=1| w, c; θ)\) must be low, i.e. pairs which are not in the data. This is achieved by negative sampling.
- Unlike the origin Skip-gram model, we does not model \(p(c| w)\) by instead model a quantity related to the join distribution of \(w\) and \(c\).
- The model is non-convex when the words and contexts representations are learned jointly. If we fix the words representation and learn only the contexts representations, or fix the contexts representation and learn only the word representations, the model reduced to logistic regression, and is convex.
- Sampling details.
The parameter \(k\) denotes the maximal window size. For each word in the corpus, a window size \(k’\) is sampled uniformly from \(1, …, k\).
- Words appearing less that
min-count
times are not considered as either words or contexts. - Frequent words are down-sampled for frequent words are less informative.
Here we see another explanation for its effec- tiveness: the effective window size grows, including context-words which are both content-full and linearly far away from the focus word, thus making the similarities more topical.
Importantly, these words are removed from the text before generating the contexts. This has the effect of increasing the effective window size for a certain words.
The distributional hypothesis states that words in similar contexts have similar meanings. The objective above clearly tries to increase the quantity \(vw\topvc\) for good word-context pairs, and decrease it for bad ones. Intuitively, this means that words that share many contexts will be similar to each other (note also that contexts sharing many words will also be similar to each other). This is, however, very hand-wavy.
cite:turian-2010-word-repres
cite:levy-2014-depend-based
bibliographystyle:unsrt bibliography:references.bib