🌳 Evolution of NLP: From Word2Vec to GPT-5

An overview of how natural-language modeling evolved from statistical embeddings to modern large language models.

🕰️ Historical Overview

Early NLP (Pre-2010)
│
├── Bag-of-Words & TF-IDF (1980s–2000s)
│     └── Represent words as sparse frequency vectors.
│
├── Latent Semantic Analysis (LSA, 1999)
│     └── Matrix factorization of term-document matrix → captures latent semantics.
│
└── Statistical / Shallow Embedding Era (2013–2016)
      │
      ├── Word2Vec (Mikolov et al., 2013)
      │     ├── Predictive model using shallow neural nets (CBOW / Skip-Gram)
      │     ├── Learns embeddings from local context windows
      │     └── “king − man + woman ≈ queen” style analogies
      │
      ├── GloVe (Pennington et al., 2014)
      │     ├── Global Vectors for Word Representation
      │     ├── Uses global co-occurrence matrix (count-based)
      │     ├── Blend of Word2Vec (predictive) + LSA (matrix factorization)
      │     └── Static, context-independent embeddings
      │
      └── FastText (Facebook, 2016)
            ├── Extends Word2Vec using character n-grams
            └── Handles morphology and rare words better

───────────────────────────────────────────────
Contextual Embedding Era (2017–2018)
│
├── ELMo (Peters et al., 2018)
│     ├── Deep BiLSTM model
│     ├── Context-dependent embeddings
│     └── Marks the shift to dynamic meaning
│
├── ULMFiT (Howard & Ruder, 2018)
│     ├── Transfer learning for NLP
│     ├── Pretrain a language model → fine-tune on tasks
│     └── Early inspiration for modern fine-tuning
│
└── Transformer Architecture (Vaswani et al., 2017)
      ├── “Attention Is All You Need”
      ├── Introduces self-attention and positional encoding
      └── Foundation for all modern language models

───────────────────────────────────────────────
Transformer / Large Language Model Era (2018–Present)
│
├── BERT Family (Encoder-based)
│     ├── BERT (Devlin et al., 2018)
│     │     ├── Bidirectional Transformer encoder
│     │     ├── Masked Language Modeling + Next Sentence Prediction
│     │     └── Strong at understanding, weak at generation
│     │
│     ├── RoBERTa (2019) – better training, more data
│     ├── ALBERT (2019) – parameter sharing
│     └── DeBERTa (2021) – disentangled attention
│
├── GPT Family (Decoder-based)
│     ├── GPT (OpenAI, 2018)
│     │     ├── Unidirectional Transformer (decoder-only)
│     │     ├── Next-token prediction objective
│     │     └── First true generative Transformer
│     │
│     ├── GPT-2 (2019)
│     │     ├── 1.5 B parameters
│     │     ├── Zero-shot and few-shot learning
│     │     └── Major leap in generation quality
│     │
│     ├── GPT-3 (2020)
│     │     ├── 175 B parameters
│     │     ├── General-purpose few-shot model
│     │     └── Foundation for many modern AI apps
│     │
│     ├── GPT-3.5 (2022)
│     │     ├── Instruction-tuned (InstructGPT)
│     │     └── Optimized for conversation and reasoning (ChatGPT)
│     │
│     ├── GPT-4 (2023)
│     │     ├── Multimodal: text + image understanding
│     │     ├── Strong reasoning and factual consistency
│     │     └── Introduces system-level prompt control
│     │
│     └── GPT-5 (2025)
│           ├── Unified multimodal reasoning (text, image, audio)
│           ├── Long-context understanding and tool use
│           └── Generalized reasoning and planning capabilities
│
└── Other Transformer Families
      ├── T5 (2019) – Text-to-Text Transfer Transformer
      ├── XLNet (2019) – Permutation language modeling
      ├── BART (2019) – Denoising autoencoder for text
      ├── LLaMA (Meta, 2023) – Open-source GPT alternative
      ├── Claude (Anthropic, 2023) – Constitutional AI safety
      ├── Gemini (Google DeepMind, 2024) – Multimodal reasoning
      ├── Mistral (2024) – Lightweight open models
      └── Phi (Microsoft, 2024) – Small efficient LLMs

───────────────────────────────────────────────

🧠 Summary of the Evolution

Pre-Deep Learning — Count-based representations (TF-IDF, LSA)
Shallow Neural Models — Word2Vec, GloVe, FastText (static embeddings)
Deep Contextual Models — ELMo, ULMFiT (context-aware embeddings)
Transformer Revolution — Attention mechanism changes everything
LLM Explosion — GPT, BERT, T5, Claude, Gemini, GPT-5 and beyond

bobvanluijt/evolution.md

Select an option

No results found

Select an option

No results found

🌳 Evolution of NLP: From Word2Vec to GPT-5

🕰️ Historical Overview

🧠 Summary of the Evolution