Comparison of 2014 Attention RNN vs. 2017 Transformer

🔁 2014: Attention with RNNs (Bahdanau et al.)

🧠 Architecture: RNN Encoder + Attention + RNN Decoder

Input sequence → [ RNN Encoder ] → Hidden states
                                ↘
                               [ Attention Layer ] ⇨ Context vector
                                                    ⇓
                        Output sequence ← [ RNN Decoder ]

🧩 Key Features:

Encoder: A bidirectional RNN (e.g., Bi-LSTM) processes the input sequence and produces a sequence of hidden states.
Attention: At each decoder step, a weighted sum over encoder hidden states is computed — the context vector.
Decoder: A unidirectional RNN generates output tokens one at a time.

🔧 Attention Type:

Additive attention (Bahdanau attention)
Uses learned weights and non-linear activations

🔁 2017: Transformer (Vaswani et al.)

🧠 Architecture: Self-Attention Everywhere

Input sequence
      ⇓
[ Multi-Head Self-Attention ]
      ⇓
[ Feedforward Network ]
      ⇓
(Repeated N times)

          ⇓
Encoded output → [ Decoder Block with Self-Attention + Cross-Attention ]
                                       ⇓
                                Output sequence

🧩 Key Features:

No RNNs or CNNs — entirely based on attention.
Self-attention allows each token to attend to all others.
Cross-attention in the decoder accesses encoder outputs.
Positional encodings provide order information.
Fully parallelizable.

🔧 Attention Type:

Scaled dot-product attention

⚖️ Summary Comparison

Feature	2014 Attention RNN	2017 Transformer
Core Architecture	RNN encoder-decoder	Fully attention-based (no RNNs)
Attention Use	On top of encoder hidden states	Used throughout (self + cross)
Computation	Sequential (decoder is slow)	Fully parallelizable
Performance	Good on small datasets/tasks	State-of-the-art for large models
Attention Mechanism	Additive	Scaled dot-product
Positional Awareness	Implicit (via RNN order)	Explicit (via positional encodings)

simonholm/attention_rnn_vs_transformer.md

Comparison of 2014 Attention RNN vs. 2017 Transformer

🔁 2014: Attention with RNNs (Bahdanau et al.)

🧠 Architecture: RNN Encoder + Attention + RNN Decoder

🧩 Key Features:

🔧 Attention Type:

🔁 2017: Transformer (Vaswani et al.)

🧠 Architecture: Self-Attention Everywhere

🧩 Key Features:

🔧 Attention Type:

⚖️ Summary Comparison