why Bahdanau is Additive?

Bahdanau Attention is often called Additive Attention because of the mathematical formulation used to compute the attention scores. In contrast to Dot-Product (Multiplicative) Attention, Bahdanau Attention relies on addition and a non-linear activation function.

Let's go through the math step-by-step:

Definitions

( h_i ): Hidden state of the encoder for the (i)-th time step in the source sequence.
( s_t ): Hidden state of the decoder for the (t)-th time step in the target sequence.
( W_1 ) and ( W_2 ): Weight matrices.
( b ): Bias term.
( v ): Context vector.
( \text{score}(h_i, s_t) ): Attention score for (h_i) and (s_t).

Step 1: Projection

First, both ( h_i ) and ( s_t ) are linearly transformed using weight matrices ( W_1 ) and ( W_2 ): $$[ \tilde{h_i} = W_1 h_i ] [ \tilde{s_t} = W_2 s_t ]$$

Step 2: Calculate Score

The attention score between ( h_i ) and ( s_t ) is calculated using these projected vectors:

$$[ \text{score}(h_i, s_t) = \text{tanh}(\tilde{h_i} + \tilde{s_t} + b) ]$$

Here, $$(\text{tanh})$$ is a non-linear activation function. This step is the reason it is called "Additive" Attention, because the transformed hidden states are summed up (addition operation).

Step 3: Calculate Weights

The scores are then normalized using a softmax function to get the attention weights ( \alpha ):

$$[ \alpha_{i, t} = \frac{\text{exp}(\text{score}(h_i, s_t))}{\sum_j \text{exp}(\text{score}(h_j, s_t))} ]$$

Step 4: Calculate Context Vector

The context vector $$( c_t )$$ is then calculated as the weighted sum of all the encoder hidden states:

$$[ c_t = \sum_i \alpha_{i, t} h_i ]$$

This context vector is then used in the decoder's calculations.

To summarize, the key "additive" part is in the scoring mechanism where the transformed hidden states from both encoder and decoder are added together. This is different from Multiplicative (Dot-Product) Attention, where the hidden states would be multiplied together to calculate the score.

ritwikraha/Bahdanau.md

Definitions

Step 1: Projection

Step 2: Calculate Score

Step 3: Calculate Weights

Step 4: Calculate Context Vector