- The paper describes an unsupervised approach to train a generic, distributed sentence encoder.
- It also describes a vocabulary expansion method to encode words not seen at training time.
- Link to the paper
- Train an encoder-decoder model where the encoder maps the input sentence to a sentence vector and the decoder generates the sentences surrounding the original sentence.
- The model is called skip-thoughts and the encoded vectors are called skip-thought vectors.
- Similar to the skip-gram model in the sense that surrounding sentences are used to learn sentence vectors.
- Training data is in form of sentence tuples (previous sentence, current sentence, next sentence).
- Encoder
- RNN Encoder with GRU.
- Decoder
- RNN Decoder with conditional GRU.
- Conditioned on encoder output.
- Extra matrices introduced to bias the update gate, reset gate and hidden state, given the encoder output.
- Vocabulary matrix (V) - Weight matrix having one row (vector) for each word in the vocabulary.
- Separate decoders for the previous and next sentence which share only V.
- Given the decoder context h (at any time), encoder output, and list of words already generated for the output sentence, the probability of choosing w as the next word is proportional to exp(V(word)h)
- Objective
- Sum of the log-probabilities for the forward and backwards sentences conditioned on the encoder output.
- Use a model like Word2Vec which can be trained to induce word representations and train it to obtain embeddings for all the words that are likely to be seen by the encoder.
- Learn a matrix W such that encoder(word) = cross-product(W, Word2Vec(word)) for all words that are common to both Word2Vec model and encoder model.
- Use W to generate embeddings for words are not seen during encoder training.
- BookCorpus dataset having books across 16 genres.
- uni-skip
- Unidirectional auto-encoder with 2400 dimensions.
- bi-skip
- Bidirectional model with forward (sentence given in correct order) and backward (sentence given in reverse order) encoders of 1200 dimensions each.
- combine-skip
- concatenation of uni-skip and bi-skip vectors.
- Initialization
- Recurrent matricies - orthogonal initialization.
- Non-recurrent matricies - uniform distribution in [-0.1,0.1].
- Mini-batches of size 128.
- Gradient Clipping at norm = 10.
- Adam optimizer.
- After learning skip-thoughts, freeze the model and use the encoder as feature extractor only.
- Evaluated the vectors with linear models on following tasks:
- Given a sentence pair, predict how closely related the two sentences are.
- skip-thoughts method outperforms all systems from SemEval 2014 competition and is outperformed only by dependency tree-LSTMs.
- Using features learned from image-sentence embedding model on COCO boosts performance and brings it at par with dependency tree-LSTMs.
- skip-thoughts outperforms recursive nets with dynamic pooling if no hand-crafted features are used.
- skip-thoughts with basic pairwise statistics produce results comparable with the state-of-the-art systems that house complicated features and hand engineering.
- MS COCO dataset
- Task
- Image annotation
- Given an image, rank the sentences on basis of how well they describe the image.
- Image search - Given a caption, find the image that is being described.
- Image annotation
- Though the system does not outperform baseline system in all cases, the results does indicate that skip-thought vectors can capture image descriptions without having to learn their representations from scratch.
- skip-thoughts perform about as good as bag-of-words baselines but are outperformed by methods where sentence representation has been learnt for the task at hand.
- Combining skip-thoughts with bi-gram Naive Bayes (NB) features improves the performance.
- Variants to be explored include:
- Fine tuning the encoder-decoder model during the downstream task instead of freezing the weights.
- Deep encoders and decoders.
- Larger context windows.
- Encoding and decoding paragraphs.
- Encoders, such as convnets.