- Neural Network with a recurrent attention model over a large external memory.
- Continous form of Memory-Network but with end-to-end training so can be applied to more domains.
- Extension of RNNSearch and can perform multiple hops (computational steps) over the memory per symbol.
- Link to the paper.
- Link to the implementation.
- Model takes as input x1,...,xn (to store in memory), query q and outputs answer a.
- Input set (xi) embedded in D-dimensional space, using embedding using matrix A to obtain memory vectors (mi).
- Query is also embedded using matrix B to obtain internal state u.
- Compute match between each memory mi and u in the embedding space followed by softmax operation to obtain probability vector p over the inputs.
- Each xi maps to an output vector ci (using embedding matrix C).
- Output o = weighted sum of transformed input ci, weighted by pi.
- Sum of output vector, o and embedding vector, u, is passed through weight matrix W followed by softmax to produce output.
- A, B, C and W are learnt by minimizing cross entropy loss.
- For layers above the first layer, input uk+1 = uk + ok.
- Each layer has its own Ak and Ck - with constraints.
- At final layer, output o = Softmax(W(oK, uK))
-
Adjacent
- Output embedding for one layer is input embedding for another ie Ak+1 = Ck
- W = Ck
- B = A1
-
Layer-wise (RNN-like)
- Same input and output embeddings across layes ie A1 = A2 ... = AK and C1 = C2 ... = CK.
- A linear mapping H is added to update of u between hops.
- uk+1 = Huk + ok.
- H is also learnt.
- Think of this as a traditional RNN with 2 outputs
- Internal output - used for memory consideration
- External output - the predicted result
- u becomes the hidden state.
- p is an internal output which, combined with C is used to update the hidden state.
- RNN - Memory stored as the state of the network and unusable in long temporal contexts.
- LSTM - Locks network state using local memory cells. Fails over longer temporal contexts.
- Memory Networks - Uses global memory.
- Bidirectional RNN - Uses a small neural network with sophisticated gated architecture (attention model) to find useful hidden states but unlike MemNN, perform only a single pass over the memory.
-
Bag-of-words representation
- Input sentences and questions are embedded as a bag of words.
- Can not capture the order of the words.
-
Position Encoding
- Takes into account the order of words.
-
Temporal Encoding
- Temporal information encoded by matrix TA and memory vectors are modified as
mi = sum(Axij) + TA(i)
-
Random Noise
- Dummy Memories (empty memories) are added at training time to regularize TA.
-
Linear Start (LS) training
- Removes softmax layers when starting training and insert them when validation loss stops decreasing.
- Best MemN2N models are close to supervised models in performance.
- Position Encoding improves over bag-of-words approach.
- Linear Start helps to avoid local minima.
- Random Noise gives a small yet consistent boost in performance.
- More computational hops leads to improved performance.
- For Language Modelling Task, some hops concentrate on recent words while other hops have more broad attention span over all memory locations.