Skip-Thought Vectors

Introduction

The paper describes an unsupervised approach to train a generic, distributed sentence encoder.
It also describes a vocabulary expansion method to encode words not seen at training time.
Link to the paper

Train an encoder-decoder model where the encoder maps the input sentence to a sentence vector and the decoder generates the sentences surrounding the original sentence.
The model is called skip-thoughts and the encoded vectors are called skip-thought vectors.
Similar to the skip-gram model in the sense that surrounding sentences are used to learn sentence vectors.

Training data is in form of sentence tuples (previous sentence, current sentence, next sentence).
Encoder
- RNN Encoder with GRU.
Decoder
- RNN Decoder with conditional GRU.
- Conditioned on encoder output.
- Extra matrices introduced to bias the update gate, reset gate and hidden state, given the encoder output.
- Vocabulary matrix (V) - Weight matrix having one row (vector) for each word in the vocabulary.
- Separate decoders for the previous and next sentence which share only V.
- Given the decoder context h (at any time), encoder output, and list of words already generated for the output sentence, the probability of choosing w as the next word is proportional to exp(V(word)h)
Objective
- Sum of the log-probabilities for the forward and backwards sentences conditioned on the encoder output.

Use a model like Word2Vec which can be trained to induce word representations and train it to obtain embeddings for all the words that are likely to be seen by the encoder.
Learn a matrix W such that encoder(word) = cross-product(W, Word2Vec(word)) for all words that are common to both Word2Vec model and encoder model.
Use W to generate embeddings for words are not seen during encoder training.

uni-skip
- Unidirectional auto-encoder with 2400 dimensions.
bi-skip
- Bidirectional model with forward (sentence given in correct order) and backward (sentence given in reverse order) encoders of 1200 dimensions each.
combine-skip
- concatenation of uni-skip and bi-skip vectors.
Initialization
- Recurrent matricies - orthogonal initialization.
- Non-recurrent matricies - uniform distribution in [-0.1,0.1].
Mini-batches of size 128.
Gradient Clipping at norm = 10.
Adam optimizer.

After learning skip-thoughts, freeze the model and use the encoder as feature extractor only.
Evaluated the vectors with linear models on following tasks:

Given a sentence pair, predict how closely related the two sentences are.
skip-thoughts method outperforms all systems from SemEval 2014 competition and is outperformed only by dependency tree-LSTMs.
Using features learned from image-sentence embedding model on COCO boosts performance and brings it at par with dependency tree-LSTMs.

skip-thoughts outperforms recursive nets with dynamic pooling if no hand-crafted features are used.
skip-thoughts with basic pairwise statistics produce results comparable with the state-of-the-art systems that house complicated features and hand engineering.

MS COCO dataset
Task
- Image annotation
  - Given an image, rank the sentences on basis of how well they describe the image.
- Image search - Given a caption, find the image that is being described.
Though the system does not outperform baseline system in all cases, the results does indicate that skip-thought vectors can capture image descriptions without having to learn their representations from scratch.

skip-thoughts perform about as good as bag-of-words baselines but are outperformed by methods where sentence representation has been learnt for the task at hand.
Combining skip-thoughts with bi-gram Naive Bayes (NB) features improves the performance.