Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save shagunsodhani/1be86a9bcbd7f120ce55994dcd932bbf to your computer and use it in GitHub Desktop.
Save shagunsodhani/1be86a9bcbd7f120ce55994dcd932bbf to your computer and use it in GitHub Desktop.
Summary if paper "Improving Word Representations via Global Context and Multiple Word Prototypes"

Improving Word Representations via Global Context and Multiple Word Prototypes

Introduction

  • This paper pre-dated papers like Glove and Word2Vec and proposed an architecture that
    • combines local and global context while learning word embeddings to capture the word semantics.
    • learns multiple embeddings per word to account for homonymy and polysemy.
  • Link to the paper

Global Context-Aware Neural Language Model

Training Objective

  • Given a word sequence s (local context) and a document d in which the sequence occurs (global context), learn word representations while learning to discriminate the last correct word in s from other words.
  • g(s, d) - scoring function giving liklihood of correct sequence.
  • g(sw, d) - scoring function giving liklihood of last word in s repalced by a word w.
  • Objective - g(s, d) > g(sw, d) + 1 for any other word w.

Architecture

  • Two scoring components (neural networks) to capture:

    • Local Context
      • Map word sequence s into an ordered list of vectors x = [x1, ..., xm].
      • xi - embedding corresponding to ith word in the sequence.
      • Compute local score scorel by using a neural network (with one hidden layer) over x.
      • Preserves word order and syntactic information.
    • Global Context
      • Map document d to an ordered list of word embeddings, d = (d1, ..., dk).
      • Compute c, the weighted average of all word vectors in document.
      • The paper uses idf score for weighting the documents.
      • *x = * concatenation of c and vector of the last word in s.
      • Compute global score scoreg by using a neural network (with two hidden layers) over x.
      • Similar to bag-of-words features. score = scorel + scoreg
    • Train the weights of the hidden layers and the word embeddings.

Multi-Prototype Neural Language Model

  • Words can have different meanings in different contexts which are difficult to capture when we train only one vector per word.

  • Solution - train multiple vectors per word to capture the different meanings.

  • Approach

    • Gather all the fixed-sized context windows for all occurrences of a given word.
    • Find the context vector by performing weighted averaging of all the words in the context window.
    • Cluster the context vectors using spherical k-means.
    • Each word occurrence in the corpus is re-labeled to its associated cluster.
    • To find similarity between a pair of words (w, w'):
      • For each possible cluster of i and j corresponding to the words w and w', find distance between cluster centers for i and j and weight them by the product of probabilities of w belonging to i and w' belonging to j given their respective contexts.
      • Average the value over the k2 pairs.

Training

  • Dataset

    • Wikipedia corpus
  • Parameters

    • 10-word windows
    • 100 hidden units
    • No weight regularization
    • 10 different word embeddings learnt for words having multiple meanings.

Evaluation

  • Dataset

    • WordSim-353
      • 353 pairs of nouns
      • words represented without context
      • contains human similarity judgements on pair of words
    • The paper contributed a new dataset
      • captures human similarity judgements on pair of words in the context of a sentence
      • consists of verbs and adjectives along with nouns
      • for details on how the dataset is constructed, refer the paper
  • Performance

    • Proposed model achieves higher correlation to human scores than models using only the local or global context.
    • Performance can be improved by removing the stop words.
    • Using multi-prototype approach (multiple vectors for the same word) benefits the model on the tasks where the context is also given.

Comments

  • This work predated the more general word embedding models like Word2Vec and Glove. While this model performs good at intrinsic evaluation tasks like word similarity, it is outperformed by the more general and recent models on downstream tasks like NER.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment