- Machine Comprehension (MC) - given a natural language sentence, answer a natural language question.
- End-To-End MC - can not use language resources like dependency parsers. The only supervision during training is the correct answer.
- Query Regression Network (QRN) - Variant of Recurrent Neural Network (RNN).
- Link to the paper
- Long Short-Term Memory (LSTM) and Gated Recurrence Unit (GRU) are popular choices to model sequential data but perform poorly on end-to-end MC due to long-term dependencies.
- Attention Models with shared external memory focus on single sentences in each layer but the models tend to be insensitive to the time step of the sentence being accessed.
- Memory Networks (and MemN2N)
- Add time-dependent variable to the sentence representation.
- Summarize the memory in each layer to control attention in the next layer.
- Dynamic Memory Networks (and DMN+)
- Combine RNN and attention mechanism to incorporate time dependency.
- Uses 2 GRU
- time-axis GRU - Summarize the memory in each layer.
- layer-axis GRU - Control the attention in each layer.
- QRN is a much simpler model without any memory summarized node.
- Single recurrent unit that updates its internal state through time and layers.
- Inputs
- qt - local query vector
- xt - sentence vector
- Outputs
- ht - reduced query vector
- xt - sentence vector without any modifications
- Equations
- zt = α(xt, qt)
- &alpha is the update gate function to measure the relevance between input sentence and local query.
- h`t = γ(xt, qt)
- &gamma is the regression function to transform the local query into regressed query.
- ht = zt*h`t + (1 - zt)*ht-1
- To create a multi layer model, output of current layer becomes input to the next layer.
- Reset gate function (rt) to reset or nullify the regressed query h`t (inspired from GRU).
- The new equation becomes ht = zt*rt*h`t + (1 - zt)*ht-1
- Vector gates - update and reset gate functions can produce vectors instead of scalar values (for finer control).
- Bidirectional - QRN can look at both past and future sentences while regressing the queries.
- qtk+1 = htk, forward + htk, backward.
- The variables of update and regress functions are shared between the two directions.
- Unlike most RNN based models, recurrent updates in QRN can be computed in parallel across time.
- For details and equations, refer the paper.
- A trainable embedding matrix A is used to encode the one-hot vector of each word in the input sentence into a d-dimensional vector.
- Position Encoder is used to obtain the sentence representation from the d-dimensional vectors.
- Question vectors are also obtained in a similar manner.
- A V-way single-layer softmax classifier is used to map predicted answer vector y to a V-dimensional sparse vector v.
- The natural language answer y is the arg max word in v.
- bAbI QA dataset used.
- QRN on 1K dataset with '2rb' (2 layers + reset gate + bidirectional) model and on 10K dataset with '2rvb' (2 layers + reset gate + vector gate + bidirectional) outperforms MemN2N 1K and 10K models respectively.
- Though DMN+ outperforms QRN with a small margin, QRN are simpler and faster to train (the paper made the comment on the speed of training without reporting the training time of the two models).
- With very few layers, the model lacks reasoning ability while with too many layers, the model becomes difficult to train.
- Using vector gates works for large datasets while hurts for small datasets.
- Unidirectional models perform poorly.
- The intermediate query updates can be interpreted in natural language to understand the flow of information in the network.