- Knowledge Bases (KBs) are effective tools for Question Answering (QA) but are often too restrictive (due to fixed schema) and too sparse (due to limitations of Information Extraction (IE) systems).
- The paper proposes Key-Value Memory Networks, a neural network architecture based on Memory Networks that can leverage both KBs and raw data for QA.
- The paper also introduces MOVIEQA, a new QA dataset that can be answered by a perfect KB, by Wikipedia pages and by an imperfect KB obtained using IE techniques thereby allowing a comparison between systems using any of the three sources.
- Link to the paper.
- TRECQA and WIKIQA are two benchmarks where systems need to select the sentence containing the correct answer, given a question and a list of candidate sentences.
- These datasets are small and make it difficult to compare the systems using different sources.
- Best results on these benchmarks are reported by CNNs and RNNs with attention mechanism.
- Extension of Memory Networks Model.
- Generalises the way context is stored in memory.
- Comprises of a memory made of slots in the form of pair of vectors (k1, v1)...(km, vm) to encode long-term and short-term context.
- Key Hashing - Question, x is used to preselect a subset of array (kh1, vh1)...(khN, vhN) where the key shares atleast one word with x and frequency of the words is less than 1000.
- Key Addresing - Each candidate memory is assigned a relevance probability:
- phi = softmax(AφX(x).AφK(khi))
- φ is a feature map of dimension D and A is a dxD matrix.
- Value Reading - Value of memories are read by taking their weighted sum using addressing probabilites and a vector o is returned.
- o = sum(phiAφV(vhi))
- Memory access process conducted by "controller" neural network using q = AφX(x) as the query.
- Query is updated using
- q2 = R1(q+o)
- Addressing and reading steps are repeated using new Ri matrices to retrive more pertinent information in subsequent access.
- After a fixed number of hops, H, resulting state of controller is used to compute a final prediction.
- a = argmax(softmax(qH+1TBφY(yi))) where yi are the possible candidate outputs and B is a dXD matrix.
- The network is trained end-to-end using a cross entropy loss, backpropogation and stochastic gradient.
- End-to-End Memory Networks can be viewed as a special case of Key-Value Memory Networks by setting key and value to be the same for all the memories.
- φx and φy - feature map corresponding to query and answer are fixed as bag-of-words representation.
- Triplets of the form "subject relation object" can be represented in Key-Value Memory Networks with subject and relation as the key and object as the value.
- In standard Memory Networks, the whole triplet would have to be embedded in the same memory slot.
- The reversed relations "object is_related_to subject" can also be stored.
- A document can be split into sentences with each sentence encoded in the key-value pair of the memory slot as a bag-of-words.
- Split the document in the windows of W words and represent it as bag-of-words.
- The window becomes the key and the central word becomes the value.
- Instead of mixing the window centre with the rest of the words, double the size of the dictionary and encode the centre of the window and the value using the second dictionary.
- Since title of the document could contain useful information, the word window can be encoded as the key and document title as the value.
- The key could be augmented with features like "window" and "title" to distinguish between different cases.
- Doc - Raw documents (from Wikipedia) related to movies.
- KB - Graph based KB made of entities and relations.
- IE - Performing Information Extraction on Wikipedia to create a KB.
- The QA pairs should be answerable by both raw document and KB so that the three approaches can be compared and the gap between the three solutions can be closed.
- The dataset has more than 100000 QA pairs, making it much larger than most existing datasets.
- Bordes et al's QA system
- Supervised Embeddings(without KB)
- Memory Networks
- Key-Value Memory Networks
- Key-Value Memory Networks outperforms all methods on all data sources.
- KB > Doc > IE
- The best memory representation for directly reading documents uses "Window Level + Centre Encoding + Title".
- Given KB triplets, construct synthetic "Wikipedia" articles using templates, conjuctions and coreferences to determine the causes for gap in performance when using KB vs doc.
- Loss in One Template sentences are due to difficulty of extracting subject, relation and object from the artifical docs.
- Using multiple templates does not detoriate performance much. But conjuctions and coreferences cause a dip in performance.
- Given a question, select the sentence (from Wikipedia document) that best answers the question.
- Key-Value Memory Networks outperforms all other solutions though it is only marginally better than LDC (Sentence Similarity Learning by Lexical Decomposition and Composition) and attentive models based on CNNs and RNNs.
In the sentence level, what do play the role of the key and value? In this case, what's the difference between the Key-value MM and the traditional MM?