Comparison of 2014 Attention RNN vs. 2017 Transformer
π 2014: Attention with RNNs (Bahdanau et al.)
π§ Architecture: RNN Encoder + Attention + RNN Decoder
Input sequence β [ RNN Encoder ] β Hidden states
β
[ Attention Layer ] β¨ Context vector
β
Output sequence β [ RNN Decoder ]
Encoder : A bidirectional RNN (e.g., Bi-LSTM) processes the input sequence and produces a sequence of hidden states.
Attention : At each decoder step, a weighted sum over encoder hidden states is computed β the context vector.
Decoder : A unidirectional RNN generates output tokens one at a time.
Additive attention (Bahdanau attention)
Uses learned weights and non-linear activations
π 2017: Transformer (Vaswani et al.)
π§ Architecture: Self-Attention Everywhere
Input sequence
β
[ Multi-Head Self-Attention ]
β
[ Feedforward Network ]
β
(Repeated N times)
β
Encoded output β [ Decoder Block with Self-Attention + Cross-Attention ]
β
Output sequence
No RNNs or CNNs β entirely based on attention.
Self-attention allows each token to attend to all others.
Cross-attention in the decoder accesses encoder outputs.
Positional encodings provide order information.
Fully parallelizable.
Scaled dot-product attention
βοΈ Summary Comparison
Feature
2014 Attention RNN
2017 Transformer
Core Architecture
RNN encoder-decoder
Fully attention-based (no RNNs)
Attention Use
On top of encoder hidden states
Used throughout (self + cross)
Computation
Sequential (decoder is slow)
Fully parallelizable
Performance
Good on small datasets/tasks
State-of-the-art for large models
Attention Mechanism
Additive
Scaled dot-product
Positional Awareness
Implicit (via RNN order)
Explicit (via positional encodings)