Efficient Estimation of Word Representations in Vector Space

Introduction

Computational complexity defined in terms of a number of parameters accessed during model training.
Proportional to E*T*Q
E - Number of training epochs
T - Number of words in training set
Q - depends on the model

Probabilistic model with input, projection, hidden and output layer.
Input layer encodes N previous word using 1-of-V encoding (V is vocabulary size).
Input layer projected to projection layer P with dimensionality N*D
Hidden layer (of size H) computes the probability distribution over all words.
Complexity per training example Q =N*D + N*D*H + H*V
Can reduce Q by using hierarchical softmax and Huffman binary tree (for storing vocabulary).

Nonlinear hidden layer causes most of the complexity.
NNLMs can be successfully trained in two steps:
- Learn continuous word vectors using simple models.
- N-gram NNLM trained over the word vectors.

Similar to feedforward NNLM.
No nonlinear hidden layer.
Projection layer shared for all words and order of words does not influence projection.
Log-linear classifier uses a window of words to predict the middle word.
Q = N*D + D*log₂V

Similar to Continuous Bag-of-Words but uses the middle world of the window to predict the remaining words in the window.
Distant words are given less weight by sampling fewer distant words.
Q = C*(D + D*log₂V) where C is the max distance of the word from the middle word.
Given a C and a training data, a random R is chosen in range 1 to C.
For each training word, R words from history (previous words) and R words from future (next words) are marked as target output and model is trained.

Skip-gram beats all other models for semantic accuracy tasks (eg - relating Athens with Greece).
Continuous Bag-of-Words Model outperforms other models for semantic accuracy tasks (eg great with greater) - with skip-gram just behind in performance.
Skip-gram architecture combined with RNNLMs outperforms RNNLMs (and other models) for Microsoft Research Sentence Completion Challenge.
Model can learn relationships like "Queen is to King as Woman is to Man". This allows algebraic operations like Vector("King") - Vector("Man") + Vector("Woman").