Skip to content

Instantly share code, notes, and snippets.

@simonholm
Created June 2, 2025 16:03
Show Gist options
  • Save simonholm/3ce39039b25cff61a9586a41ecab9807 to your computer and use it in GitHub Desktop.
Save simonholm/3ce39039b25cff61a9586a41ecab9807 to your computer and use it in GitHub Desktop.

Comparison of 2014 Attention RNN vs. 2017 Transformer

πŸ” 2014: Attention with RNNs (Bahdanau et al.)

🧠 Architecture: RNN Encoder + Attention + RNN Decoder

Input sequence β†’ [ RNN Encoder ] β†’ Hidden states
                                β†˜
                               [ Attention Layer ] ⇨ Context vector
                                                    ⇓
                        Output sequence ← [ RNN Decoder ]

🧩 Key Features:

  • Encoder: A bidirectional RNN (e.g., Bi-LSTM) processes the input sequence and produces a sequence of hidden states.
  • Attention: At each decoder step, a weighted sum over encoder hidden states is computed β€” the context vector.
  • Decoder: A unidirectional RNN generates output tokens one at a time.

πŸ”§ Attention Type:

  • Additive attention (Bahdanau attention)
  • Uses learned weights and non-linear activations

πŸ” 2017: Transformer (Vaswani et al.)

🧠 Architecture: Self-Attention Everywhere

Input sequence
      ⇓
[ Multi-Head Self-Attention ]
      ⇓
[ Feedforward Network ]
      ⇓
(Repeated N times)

          ⇓
Encoded output β†’ [ Decoder Block with Self-Attention + Cross-Attention ]
                                       ⇓
                                Output sequence

🧩 Key Features:

  • No RNNs or CNNs β€” entirely based on attention.
  • Self-attention allows each token to attend to all others.
  • Cross-attention in the decoder accesses encoder outputs.
  • Positional encodings provide order information.
  • Fully parallelizable.

πŸ”§ Attention Type:

  • Scaled dot-product attention

βš–οΈ Summary Comparison

Feature 2014 Attention RNN 2017 Transformer
Core Architecture RNN encoder-decoder Fully attention-based (no RNNs)
Attention Use On top of encoder hidden states Used throughout (self + cross)
Computation Sequential (decoder is slow) Fully parallelizable
Performance Good on small datasets/tasks State-of-the-art for large models
Attention Mechanism Additive Scaled dot-product
Positional Awareness Implicit (via RNN order) Explicit (via positional encodings)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment