Bahdanau Attention is often called Additive Attention because of the mathematical formulation used to compute the attention scores. In contrast to Dot-Product (Multiplicative) Attention, Bahdanau Attention relies on addition and a non-linear activation function.
Let's go through the math step-by-step:
- ( h_i ): Hidden state of the encoder for the (i)-th time step in the source sequence.
- ( s_t ): Hidden state of the decoder for the (t)-th time step in the target sequence.
- ( W_1 ) and ( W_2 ): Weight matrices.
- ( b ): Bias term.
- ( v ): Context vector.
- ( \text{score}(h_i, s_t) ): Attention score for (h_i) and (s_t).
First, both ( h_i ) and ( s_t ) are linearly transformed using weight matrices ( W_1 ) and ( W_2 ):
The attention score between ( h_i ) and ( s_t ) is calculated using these projected vectors:
Here,
The scores are then normalized using a softmax function to get the attention weights ( \alpha ):
The context vector
This context vector is then used in the decoder's calculations.
To summarize, the key "additive" part is in the scoring mechanism where the transformed hidden states from both encoder and decoder are added together. This is different from Multiplicative (Dot-Product) Attention, where the hidden states would be multiplied together to calculate the score.