Upto Speed on Transformers

Introduction

This is for people who already understand some neural networks and want to get to speed on Transformers. We assume you know the following:

Fully connected neural network
Convolutional Neural Network (CNN)
Recurrent Neural Network (RNN)
Sequence to sequence (using say CTC loss)
Basics of auto-differentiation and back-propagation

There will not be much description here. We will use links to research papers, articles, tutorials, demos, etc.. There will be links and summaries here, with pseudo-code snippets.

RNNs

We take a step back and look at RNNs first and how far they go. RNNs allow us to work on sequence of inputs and outputs of possibly different lengths.

1 => 1 image classification (don't need RNN)
1 => n eg. description of a 64x64 image
n => 1 eg. sentiment of tweet
n => n
- asynchronous eg. text translation
- synchronous eg. video frame labeleing

A good overview seems to be by Karpathy.

from 2015
about image captioning with RNNs
links to article and github repository
character-level language models based on multi-layer LSTMs

Basic Math of Vanilla RNN

$h_t = \mathrm{tanh}(W_{hh}h_{t-1} + W_{ih}x_t)$
$y_t = \mathrm{tanh}(W_{ho}h_{t})$

Example Char by Char predictor

Input is a sequence of letters encoded in one-hot of 26 fashion.
Output is the same delayed by one unit
Ouput activation is 26 softmax. Loss is Cross-entropy.
Multi-layer LSTM RNN
Applications include:
- Single author like Shakespeare
- Wikipedia with markup
- Math with Latex
- Linux source code
How does it compare with an $n$-gram model?

Generating English a character at a time -- not so impressive, the RNN needs to learn the previous n letters, for a rather small n, and that's it. However, the code-generation example is very impressive. Why? because of the context awareness. Note that in all of the posted examples, the code is well indented, the braces and brackets are correctly nested, and even the comments start and end correctly. This is not something that can be achieved by simply looking at the previous $n$ letters.

References

Generate sequences with RNNs by Alex Graves

Encoder-Decoder Model for RNNs

Sequence to Sequence Learning with Neural Networks (Ilya Sutskever, Oriol Vinyals, Quoc V. Le)
Centered around building a fixed length context vector to encapsulate the whole input:

use one LSTM to read the input sequence, one timestep at a time, to obtain large fixed-dimensional vector representation, and then to use another LSTM to extract the output sequence from that vector. Figure 1.
A useful property of the LSTM is that it learns to map an input sentence of variable length into a fixed-dimensional vector representation.
Our actual models differ from the above description in three important ways:
1. We used two different LSTMs: one for the input sequence and another for the output sequence, because doing so increases the number model parameters at negligible computational cost and makes it natural to train the LSTM on multiple language pairs simultaneously [18].
2. We found that deep LSTMs significantly outperformed shallow LSTMs, so we chose an LSTM with four layers.
3. We found it extremely valuable to reverse the order of the words of the input sentence. The simple trick of reversing the words in the source sentence is one of the key technical contributions of this work.
WMT’14 English to French MT
Used a fixed vocabulary for both languages.
Beam search decoding after optimizing the loss $\Sigma ~\mathrm{log} ~p(T|S)$
Used deep LSTMs with 4 layers (each for encoder and decoder), with 1000 cells at each layer and 1000 dimensional word embeddings, with an input vocabulary of 160,000 and an output vocabulary of 80,000. Thus the deep LSTM uses 8000 real numbers to represent a sentence.

Attention Model

Neural Machine Translation by Jointly Learning to Align and Translate (Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio)
Motivation to improve on Encoder-Decoder approach:

A potential issue with this encoder–decoder approach is that a neural network needs to be able to compress all the necessary information of a source sentence into a fixed-length vector. This may make it difficult for the neural network to cope with long sentences, especially those that are longer than the sentences in the training corpus.
Circumvention

an extension to the encoder–decoder model which learns to align and translate jointly. Each time the proposed model generates a word in a translation, it (soft-)searches for a set of positions in a source sentence where the most relevant information is concentrated. The model then predicts a target word based on the context vectors associated with these source positions and all the previous generated target words.

rakeshvar/to_transformers.md

Upto Speed on Transformers

Introduction

RNNs

Encoder-Decoder Model for RNNs

Attention Model