Skip to content

Instantly share code, notes, and snippets.

@shagunsodhani
Last active March 28, 2024 11:17
Show Gist options
  • Save shagunsodhani/c7a03a47b3d709e7c592fa7011b0f33e to your computer and use it in GitHub Desktop.
Save shagunsodhani/c7a03a47b3d709e7c592fa7011b0f33e to your computer and use it in GitHub Desktop.
Notes for "Memory Networks" paper

Memory Networks

Introduction

  • Memory Networks combine inference components with a long-term memory component.
  • Used in the context of Question Answering (QA) with memory component acting as a (dynamic) knowledge base.
  • Link to the paper.

Related Work

  • Knowledge Base approach

    • Use information retrieval techniques to build a knowledge base and perform inference over it.
    • In contrast, Memory Networks store memory as raw text or embedded vectors and perform extraction on the fly.
  • Associative Memory Network distribute memory across the whole network while memory-based learning compartmentalizes the memory. In contrast, Memory Networks combines compartmentalized memory with neural network modules that learn how to read and write to the memory.

  • Neural Turing Machine (NTM) performs sequence prediction using read-writeable "large, addressable memory" and performs sorting, copy and recall operations on it. But they consider very small memory (128 slots) as compared to Memory Networks (14M sentences).

  • RNNs and LSTMs can perform QA tasks but they can remember only short-term dependencies.

Memory Networks

  • Memory m (an array of objects indexed by mi).
  • Four learnable components
    • I - Input feature map
      • Converts input to internal feature representation using standard processing techniques.
    • G - Generalization
      • Updates old memories given new input x
      • mi = G(mi, I(x), m) for all i.
      • Simplest form would be to store I(x) in a slot of memory ie mH(x) = I(x)
    • O - Output feature
      • Produce new output in feature space, given the input x and memory m
      • Output feature o = O(I(x), m)
    • R - Response
      • Convert output into desired format
      • r = R(o)

Memory Neural Networks (MemNN)

Basic Model

  • Input (x) to I is a sentence and is stored in the next available memory slot.

  • O module produces output features by finding k supporting memories for the given input.

  • For k = 1, the highest scoring supporting memory is:

    o1 = O1(x, m) = argmax (sO(x, mi)) for all i.

  • For k = 2,

    o2 = O2(x, m) = argmax (sO([x, mo1], mi)) for all i.

  • Final input o = [x, mo1,mo2]

  • R either returns mok or uses a RNN to perform true sentence generation.

  • Alternatively, rank all the words seen by the model and return highest scoring word.

  • r = argmax(sR([x, mo1, mo2], w)) for all w in W(set of all words in the dictionary).

  • Both scoring functions so and sr are in form of an embedding model:

    s(x, y) = φx(x)TUTy(y)

    • U : n x D matrix
    • n : embedding dimension
    • D : number of features
    • φ(x), φ(y) : map text to D dimensional space.
  • Training

    • Fully supervised.
    • Marginal ranking loss.
    • Stochastic Gradient Descent.
    • For loss function, refer paper.

Word Sequences As Inputs

  • Learn a segmentation function to look up breakpoints.
  • seg(c) = WsegTUSφseg(c)
    • Wseg : parameters of linear classifier in embedded space.
    • c : sequence of input words as a bag of words using a separate dictionary.

Efficient Memory via Hashing

  • Hash the input I(x) into buckets and only score memories in the same bucket.
  • Hashing via clustering word embeddings performs better than just hashing words.

Modelling Write Time

  • Extend model to account for the time when memory was written to answer questions involving temporal context.

Modelling Previously Unseen Words

  • In the case of a new word, use the neighboring words to predict the embedding of the new word.
  • Use a "dropout" kind approach - d% of the time, assume the word is new and use its context to derive its embedding.

Exact Matches And Unseen Words

  • Embedding modes can not efficiently use exact word matches due to low dimensionality.
  • Add "bag of words" matching scores to the learned embedding scores.
  • Alternatively, stay in n-dimensional embedding space but extend the feature representation D with matching features to indicate if a word occurs in both x and y.

Observations

  • RNNs perform poorly on tasks that require encoding long term memory.
  • LSTMs are an improvement over RNNs but are still inadequate.
  • Memory Networks outperform both LSTMs and RNNs in long-term memory encoding.
  • Time feature and unseen word modelling helps to enhance the performance.

Limitations

  • Difficult to train via backpropagation.
  • Requires full supervision which limits their applicability to different domains.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment