End-To-End Memory Networks

Introduction

Neural Network with a recurrent attention model over a large external memory.
Continous form of Memory-Network but with end-to-end training so can be applied to more domains.
Extension of RNNSearch and can perform multiple hops (computational steps) over the memory per symbol.
Link to the paper.
Link to the implementation.

Model takes as input x₁,...,x_n (to store in memory), query q and outputs answer a.

Input set (x_i) embedded in D-dimensional space, using embedding using matrix A to obtain memory vectors (m_i).
Query is also embedded using matrix B to obtain internal state u.
Compute match between each memory m_i and u in the embedding space followed by softmax operation to obtain probability vector p over the inputs.
Each x_i maps to an output vector c_i (using embedding matrix C).
Output o = weighted sum of transformed input c_i, weighted by p_i.
Sum of output vector, o and embedding vector, u, is passed through weight matrix W followed by softmax to produce output.
A, B, C and W are learnt by minimizing cross entropy loss.

Adjacent
- Output embedding for one layer is input embedding for another ie A^k+1 = C^k
- W = C^k
- B = A¹
Layer-wise (RNN-like)
- Same input and output embeddings across layes ie A¹ = A² ... = A^K and C¹ = C² ... = C^K.
- A linear mapping H is added to update of u between hops.
- u^k+1 = Hu^k + o^k.
- H is also learnt.
- Think of this as a traditional RNN with 2 outputs
  - Internal output - used for memory consideration
  - External output - the predicted result
  - u becomes the hidden state.
  - p is an internal output which, combined with C is used to update the hidden state.

RNN - Memory stored as the state of the network and unusable in long temporal contexts.
LSTM - Locks network state using local memory cells. Fails over longer temporal contexts.
Memory Networks - Uses global memory.
Bidirectional RNN - Uses a small neural network with sophisticated gated architecture (attention model) to find useful hidden states but unlike MemNN, perform only a single pass over the memory.

Bag-of-words representation
- Input sentences and questions are embedded as a bag of words.
- Can not capture the order of the words.
Position Encoding
- Takes into account the order of words.
Temporal Encoding
- Temporal information encoded by matrix T_A and memory vectors are modified as
m_i = sum(Ax_ij) + T_A(i)
Random Noise
- Dummy Memories (empty memories) are added at training time to regularize T_A.
Linear Start (LS) training
- Removes softmax layers when starting training and insert them when validation loss stops decreasing.

Best MemN2N models are close to supervised models in performance.
Position Encoding improves over bag-of-words approach.
Linear Start helps to avoid local minima.
Random Noise gives a small yet consistent boost in performance.
More computational hops leads to improved performance.
For Language Modelling Task, some hops concentrate on recent words while other hops have more broad attention span over all memory locations.