- This paper pre-dated papers like Glove and Word2Vec and proposed an architecture that
- combines local and global context while learning word embeddings to capture the word semantics.
- learns multiple embeddings per word to account for homonymy and polysemy.
- Link to the paper
- Given a word sequence s (local context) and a document d in which the sequence occurs (global context), learn word representations while learning to discriminate the last correct word in s from other words.
- g(s, d) - scoring function giving liklihood of correct sequence.
- g(sw, d) - scoring function giving liklihood of last word in s repalced by a word w.
- Objective - g(s, d) > g(sw, d) + 1 for any other word w.
-
Two scoring components (neural networks) to capture:
- Local Context
- Map word sequence s into an ordered list of vectors x = [x1, ..., xm].
- xi - embedding corresponding to ith word in the sequence.
- Compute local score scorel by using a neural network (with one hidden layer) over x.
- Preserves word order and syntactic information.
- Global Context
- Map document d to an ordered list of word embeddings, d = (d1, ..., dk).
- Compute c, the weighted average of all word vectors in document.
- The paper uses idf score for weighting the documents.
- *x = * concatenation of c and vector of the last word in s.
- Compute global score scoreg by using a neural network (with two hidden layers) over x.
- Similar to bag-of-words features. score = scorel + scoreg
- Train the weights of the hidden layers and the word embeddings.
- Local Context
-
Words can have different meanings in different contexts which are difficult to capture when we train only one vector per word.
-
Solution - train multiple vectors per word to capture the different meanings.
-
Approach
- Gather all the fixed-sized context windows for all occurrences of a given word.
- Find the context vector by performing weighted averaging of all the words in the context window.
- Cluster the context vectors using spherical k-means.
- Each word occurrence in the corpus is re-labeled to its associated cluster.
- To find similarity between a pair of words (w, w'):
- For each possible cluster of i and j corresponding to the words w and w', find distance between cluster centers for i and j and weight them by the product of probabilities of w belonging to i and w' belonging to j given their respective contexts.
- Average the value over the k2 pairs.
-
Dataset
- Wikipedia corpus
-
Parameters
- 10-word windows
- 100 hidden units
- No weight regularization
- 10 different word embeddings learnt for words having multiple meanings.
-
Dataset
- WordSim-353
- 353 pairs of nouns
- words represented without context
- contains human similarity judgements on pair of words
- The paper contributed a new dataset
- captures human similarity judgements on pair of words in the context of a sentence
- consists of verbs and adjectives along with nouns
- for details on how the dataset is constructed, refer the paper
- WordSim-353
-
Performance
- Proposed model achieves higher correlation to human scores than models using only the local or global context.
- Performance can be improved by removing the stop words.
- Using multi-prototype approach (multiple vectors for the same word) benefits the model on the tasks where the context is also given.