- Memory Networks combine inference components with a long-term memory component.
- Used in the context of Question Answering (QA) with memory component acting as a (dynamic) knowledge base.
- Link to the paper.
-
Knowledge Base approach
- Use information retrieval techniques to build a knowledge base and perform inference over it.
- In contrast, Memory Networks store memory as raw text or embedded vectors and perform extraction on the fly.
-
Associative Memory Network distribute memory across the whole network while memory-based learning compartmentalizes the memory. In contrast, Memory Networks combines compartmentalized memory with neural network modules that learn how to read and write to the memory.
-
Neural Turing Machine (NTM) performs sequence prediction using read-writeable "large, addressable memory" and performs sorting, copy and recall operations on it. But they consider very small memory (128 slots) as compared to Memory Networks (14M sentences).
-
RNNs and LSTMs can perform QA tasks but they can remember only short-term dependencies.
- Memory m (an array of objects indexed by mi).
- Four learnable components
- I - Input feature map
- Converts input to internal feature representation using standard processing techniques.
- G - Generalization
- Updates old memories given new input x
- mi = G(mi, I(x), m) for all i.
- Simplest form would be to store I(x) in a slot of memory ie mH(x) = I(x)
- O - Output feature
- Produce new output in feature space, given the input x and memory m
- Output feature o = O(I(x), m)
- R - Response
- Convert output into desired format
- r = R(o)
- I - Input feature map
-
Input (x) to I is a sentence and is stored in the next available memory slot.
-
O module produces output features by finding k supporting memories for the given input.
-
For k = 1, the highest scoring supporting memory is:
o1 = O1(x, m) = argmax (sO(x, mi)) for all i.
-
For k = 2,
o2 = O2(x, m) = argmax (sO([x, mo1], mi)) for all i.
-
Final input o = [x, mo1,mo2]
-
R either returns mok or uses a RNN to perform true sentence generation.
-
Alternatively, rank all the words seen by the model and return highest scoring word.
-
r = argmax(sR([x, mo1, mo2], w)) for all w in W(set of all words in the dictionary).
-
Both scoring functions so and sr are in form of an embedding model:
s(x, y) = φx(x)TUTUφy(y)
- U : n x D matrix
- n : embedding dimension
- D : number of features
- φ(x), φ(y) : map text to D dimensional space.
-
Training
- Fully supervised.
- Marginal ranking loss.
- Stochastic Gradient Descent.
- For loss function, refer paper.
- Learn a segmentation function to look up breakpoints.
- seg(c) = WsegTUSφseg(c)
- Wseg : parameters of linear classifier in embedded space.
- c : sequence of input words as a bag of words using a separate dictionary.
- Hash the input I(x) into buckets and only score memories in the same bucket.
- Hashing via clustering word embeddings performs better than just hashing words.
- Extend model to account for the time when memory was written to answer questions involving temporal context.
- In the case of a new word, use the neighboring words to predict the embedding of the new word.
- Use a "dropout" kind approach - d% of the time, assume the word is new and use its context to derive its embedding.
- Embedding modes can not efficiently use exact word matches due to low dimensionality.
- Add "bag of words" matching scores to the learned embedding scores.
- Alternatively, stay in n-dimensional embedding space but extend the feature representation D with matching features to indicate if a word occurs in both x and y.
- RNNs perform poorly on tasks that require encoding long term memory.
- LSTMs are an improvement over RNNs but are still inadequate.
- Memory Networks outperform both LSTMs and RNNs in long-term memory encoding.
- Time feature and unseen word modelling helps to enhance the performance.
- Difficult to train via backpropagation.
- Requires full supervision which limits their applicability to different domains.