Skip to content

Instantly share code, notes, and snippets.

@bobvanluijt
Last active October 19, 2025 12:17
Show Gist options
  • Save bobvanluijt/8dcd755f2be1ef918e5881af025111bd to your computer and use it in GitHub Desktop.
Save bobvanluijt/8dcd755f2be1ef918e5881af025111bd to your computer and use it in GitHub Desktop.
Evolution of NLP: From Word2Vec to GPT-5

🌳 Evolution of NLP: From Word2Vec to GPT-5

An overview of how natural-language modeling evolved from statistical embeddings to modern large language models.


πŸ•°οΈ Historical Overview

Early NLP (Pre-2010)
β”‚
β”œβ”€β”€ Bag-of-Words & TF-IDF (1980s–2000s)
β”‚     └── Represent words as sparse frequency vectors.
β”‚
β”œβ”€β”€ Latent Semantic Analysis (LSA, 1999)
β”‚     └── Matrix factorization of term-document matrix β†’ captures latent semantics.
β”‚
└── Statistical / Shallow Embedding Era (2013–2016)
      β”‚
      β”œβ”€β”€ Word2Vec (Mikolov et al., 2013)
      β”‚     β”œβ”€β”€ Predictive model using shallow neural nets (CBOW / Skip-Gram)
      β”‚     β”œβ”€β”€ Learns embeddings from local context windows
      β”‚     └── β€œking βˆ’ man + woman β‰ˆ queen” style analogies
      β”‚
      β”œβ”€β”€ GloVe (Pennington et al., 2014)
      β”‚     β”œβ”€β”€ Global Vectors for Word Representation
      β”‚     β”œβ”€β”€ Uses global co-occurrence matrix (count-based)
      β”‚     β”œβ”€β”€ Blend of Word2Vec (predictive) + LSA (matrix factorization)
      β”‚     └── Static, context-independent embeddings
      β”‚
      └── FastText (Facebook, 2016)
            β”œβ”€β”€ Extends Word2Vec using character n-grams
            └── Handles morphology and rare words better

───────────────────────────────────────────────
Contextual Embedding Era (2017–2018)
β”‚
β”œβ”€β”€ ELMo (Peters et al., 2018)
β”‚     β”œβ”€β”€ Deep BiLSTM model
β”‚     β”œβ”€β”€ Context-dependent embeddings
β”‚     └── Marks the shift to dynamic meaning
β”‚
β”œβ”€β”€ ULMFiT (Howard & Ruder, 2018)
β”‚     β”œβ”€β”€ Transfer learning for NLP
β”‚     β”œβ”€β”€ Pretrain a language model β†’ fine-tune on tasks
β”‚     └── Early inspiration for modern fine-tuning
β”‚
└── Transformer Architecture (Vaswani et al., 2017)
      β”œβ”€β”€ β€œAttention Is All You Need”
      β”œβ”€β”€ Introduces self-attention and positional encoding
      └── Foundation for all modern language models

───────────────────────────────────────────────
Transformer / Large Language Model Era (2018–Present)
β”‚
β”œβ”€β”€ BERT Family (Encoder-based)
β”‚     β”œβ”€β”€ BERT (Devlin et al., 2018)
β”‚     β”‚     β”œβ”€β”€ Bidirectional Transformer encoder
β”‚     β”‚     β”œβ”€β”€ Masked Language Modeling + Next Sentence Prediction
β”‚     β”‚     └── Strong at understanding, weak at generation
β”‚     β”‚
β”‚     β”œβ”€β”€ RoBERTa (2019) – better training, more data
β”‚     β”œβ”€β”€ ALBERT (2019) – parameter sharing
β”‚     └── DeBERTa (2021) – disentangled attention
β”‚
β”œβ”€β”€ GPT Family (Decoder-based)
β”‚     β”œβ”€β”€ GPT (OpenAI, 2018)
β”‚     β”‚     β”œβ”€β”€ Unidirectional Transformer (decoder-only)
β”‚     β”‚     β”œβ”€β”€ Next-token prediction objective
β”‚     β”‚     └── First true generative Transformer
β”‚     β”‚
β”‚     β”œβ”€β”€ GPT-2 (2019)
β”‚     β”‚     β”œβ”€β”€ 1.5 B parameters
β”‚     β”‚     β”œβ”€β”€ Zero-shot and few-shot learning
β”‚     β”‚     └── Major leap in generation quality
β”‚     β”‚
β”‚     β”œβ”€β”€ GPT-3 (2020)
β”‚     β”‚     β”œβ”€β”€ 175 B parameters
β”‚     β”‚     β”œβ”€β”€ General-purpose few-shot model
β”‚     β”‚     └── Foundation for many modern AI apps
β”‚     β”‚
β”‚     β”œβ”€β”€ GPT-3.5 (2022)
β”‚     β”‚     β”œβ”€β”€ Instruction-tuned (InstructGPT)
β”‚     β”‚     └── Optimized for conversation and reasoning (ChatGPT)
β”‚     β”‚
β”‚     β”œβ”€β”€ GPT-4 (2023)
β”‚     β”‚     β”œβ”€β”€ Multimodal: text + image understanding
β”‚     β”‚     β”œβ”€β”€ Strong reasoning and factual consistency
β”‚     β”‚     └── Introduces system-level prompt control
β”‚     β”‚
β”‚     └── GPT-5 (2025)
β”‚           β”œβ”€β”€ Unified multimodal reasoning (text, image, audio)
β”‚           β”œβ”€β”€ Long-context understanding and tool use
β”‚           └── Generalized reasoning and planning capabilities
β”‚
└── Other Transformer Families
      β”œβ”€β”€ T5 (2019) – Text-to-Text Transfer Transformer
      β”œβ”€β”€ XLNet (2019) – Permutation language modeling
      β”œβ”€β”€ BART (2019) – Denoising autoencoder for text
      β”œβ”€β”€ LLaMA (Meta, 2023) – Open-source GPT alternative
      β”œβ”€β”€ Claude (Anthropic, 2023) – Constitutional AI safety
      β”œβ”€β”€ Gemini (Google DeepMind, 2024) – Multimodal reasoning
      β”œβ”€β”€ Mistral (2024) – Lightweight open models
      └── Phi (Microsoft, 2024) – Small efficient LLMs

───────────────────────────────────────────────

🧠 Summary of the Evolution

  1. Pre-Deep Learning β€” Count-based representations (TF-IDF, LSA)
  2. Shallow Neural Models β€” Word2Vec, GloVe, FastText (static embeddings)
  3. Deep Contextual Models β€” ELMo, ULMFiT (context-aware embeddings)
  4. Transformer Revolution β€” Attention mechanism changes everything
  5. LLM Explosion β€” GPT, BERT, T5, Claude, Gemini, GPT-5 and beyond
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment