Bahdanau Attention is often called Additive Attention because of the mathematical formulation used to compute the attention scores. In contrast to Dot-Product (Multiplicative) Attention, Bahdanau Attention relies on addition and a non-linear activation function.
Let's go through the math step-by-step:
- ( h_i ): Hidden state of the encoder for the (i)-th time step in the source sequence.
- ( s_t ): Hidden state of the decoder for the (t)-th time step in the target sequence.
- ( W_1 ) and ( W_2 ): Weight matrices.
- ( b ): Bias term.
- ( v ): Context vector.