Query Regression Networks for Machine Comprehension

Introduction

Machine Comprehension (MC) - given a natural language sentence, answer a natural language question.
End-To-End MC - can not use language resources like dependency parsers. The only supervision during training is the correct answer.
Query Regression Network (QRN) - Variant of Recurrent Neural Network (RNN).
Link to the paper

Long Short-Term Memory (LSTM) and Gated Recurrence Unit (GRU) are popular choices to model sequential data but perform poorly on end-to-end MC due to long-term dependencies.
Attention Models with shared external memory focus on single sentences in each layer but the models tend to be insensitive to the time step of the sentence being accessed.
Memory Networks (and MemN2N)
- Add time-dependent variable to the sentence representation.
- Summarize the memory in each layer to control attention in the next layer.
Dynamic Memory Networks (and DMN+)
- Combine RNN and attention mechanism to incorporate time dependency.
- Uses 2 GRU
  - time-axis GRU - Summarize the memory in each layer.
  - layer-axis GRU - Control the attention in each layer.
QRN is a much simpler model without any memory summarized node.

Single recurrent unit that updates its internal state through time and layers.
Inputs
- q_t - local query vector
- x_t - sentence vector
Outputs
- h_t - reduced query vector
- x_t - sentence vector without any modifications
Equations
- z_t = α(x_t, q_t)
- &alpha is the update gate function to measure the relevance between input sentence and local query.
- h`_t = γ(x_t, q_t)
- &gamma is the regression function to transform the local query into regressed query.
- h_t = z_t*h`_t + (1 - z_t)*h_t-1
To create a multi layer model, output of current layer becomes input to the next layer.

Reset gate function (r_t) to reset or nullify the regressed query h`_t (inspired from GRU).
- The new equation becomes h_t = z_t*r_t*h`_t + (1 - z_t)*h_t-1
Vector gates - update and reset gate functions can produce vectors instead of scalar values (for finer control).
Bidirectional - QRN can look at both past and future sentences while regressing the queries.
- q_t^k+1 = h_t^{k, forward} + h_t^{k, backward}.
- The variables of update and regress functions are shared between the two directions.

Unlike most RNN based models, recurrent updates in QRN can be computed in parallel across time.
For details and equations, refer the paper.

A trainable embedding matrix A is used to encode the one-hot vector of each word in the input sentence into a d-dimensional vector.
Position Encoder is used to obtain the sentence representation from the d-dimensional vectors.
Question vectors are also obtained in a similar manner.

A V-way single-layer softmax classifier is used to map predicted answer vector y to a V-dimensional sparse vector v.
The natural language answer y is the arg max word in v.

bAbI QA dataset used.
QRN on 1K dataset with '2rb' (2 layers + reset gate + bidirectional) model and on 10K dataset with '2rvb' (2 layers + reset gate + vector gate + bidirectional) outperforms MemN2N 1K and 10K models respectively.
Though DMN+ outperforms QRN with a small margin, QRN are simpler and faster to train (the paper made the comment on the speed of training without reporting the training time of the two models).
With very few layers, the model lacks reasoning ability while with too many layers, the model becomes difficult to train.
Using vector gates works for large datasets while hurts for small datasets.
Unidirectional models perform poorly.
The intermediate query updates can be interpreted in natural language to understand the flow of information in the network.