Created
May 21, 2026 14:27
-
-
Save bjpcjp/bbb63199c1a0cf16bcb2ec6674d00cdd to your computer and use it in GitHub Desktop.
math textbook library - glossary
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| # Mathematics & Data Science Glossary | |
| Sourced from a library of 27+ textbooks spanning linear algebra through modern deep learning. | |
| Terms are organized by topic area; cross-references use **bold**. | |
| --- | |
| ## Contents | |
| 1. [Linear Algebra & Matrix Computations](#1-linear-algebra--matrix-computations) | |
| 2. [Probability & Statistics](#2-probability--statistics) | |
| 3. [Information Theory](#3-information-theory) | |
| 4. [Optimization](#4-optimization) | |
| 5. [Classical Machine Learning](#5-classical-machine-learning) | |
| 6. [Probabilistic Graphical Models](#6-probabilistic-graphical-models) | |
| 7. [Neural Networks & Deep Learning](#7-neural-networks--deep-learning) | |
| 8. [Transformers & Large Language Models](#8-transformers--large-language-models) | |
| 9. [Generative Models](#9-generative-models) | |
| 10. [Computer Vision](#10-computer-vision) | |
| 11. [Natural Language Processing](#11-natural-language-processing) | |
| 12. [Reinforcement Learning](#12-reinforcement-learning) | |
| 13. [Algorithms & Data Structures](#13-algorithms--data-structures) | |
| 14. [Signal Processing & Transforms](#14-signal-processing--transforms) | |
| 15. [Bayesian & Probabilistic Methods](#15-bayesian--probabilistic-methods) | |
| --- | |
| ## 1. Linear Algebra & Matrix Computations | |
| **Basis** — A minimal set of linearly independent vectors that spans a vector space. Every vector in the space can be expressed as a unique linear combination of basis vectors. *(Sources: ACPS, DMA, IALA, ITA, LA4, LADR)* | |
| **Bilinear Form** — A function B(u, v) that is linear in each argument separately. The inner product is a symmetric bilinear form; quadratic forms arise when u = v. *(Sources: LADR)* | |
| **Change of Basis** — A transformation expressing coordinates in one basis in terms of another. Achieved by multiplying by the inverse of the new basis matrix. *(Sources: LA4, LADR)* | |
| **Characteristic Polynomial** — The polynomial det(A − λI) whose roots are the **eigenvalues** of matrix A. Degree equals the matrix dimension. *(Sources: LA4, LADR)* | |
| **Cholesky Decomposition** — Factorization A = LLᵀ for a symmetric positive definite matrix A, where L is lower triangular. Used in linear system solving, sampling from multivariate Gaussians, and numerical stability. *(Sources: AFO, GPML, MLP)* | |
| **Column Space (Range)** — The set of all vectors expressible as Ax for some x. Its dimension is the **rank** of A. *(Sources: DMA, LA4)* | |
| **Condition Number** — The ratio of the largest to smallest singular value of a matrix. A large condition number indicates near-singularity and sensitivity of linear system solutions to perturbations. *(Sources: MC, NFA)* | |
| **Determinant** — A scalar det(A) encoding the signed volume scaling factor of the linear transformation A. det(A) = 0 iff A is singular; det(AB) = det(A)det(B). *(Sources: ACPS, ITA, LA4, LADR, MC, OPA)* | |
| **Diagonalization** — Writing A = PDP⁻¹ where D is diagonal and the columns of P are **eigenvectors**. Possible iff A has a full set of linearly independent eigenvectors. *(Sources: ILA)* | |
| **Dot Product (Inner Product)** — The scalar ⟨u, v⟩ = uᵀv = Σᵢ uᵢvᵢ. Encodes angle (cos θ = ⟨u,v⟩ / ‖u‖‖v‖) and **projection**. *(Sources: ACPS, GPML, LA4, LADR, PDL)* | |
| **Eigenvalue** — Scalar λ such that Av = λv for nonzero vector v. Eigenvalues characterize scaling along invariant directions and appear in stability analysis, **PCA**, and matrix functions. *(Sources: GPML, LADR, OPA)* | |
| **Eigenvector** — Nonzero vector v satisfying Av = λv for scalar **eigenvalue** λ. Eigenvectors of a symmetric matrix are orthogonal. *(Sources: LADR, OPA)* | |
| **Frobenius Norm** — ‖A‖_F = √(Σᵢⱼ aᵢⱼ²) = √(tr(AᵀA)). Analogous to the L2 norm for vectors; equals the square root of the sum of squared **singular values**. *(Sources: LADR, MLP)* | |
| **Gram-Schmidt Process** — An algorithm converting a set of linearly independent vectors into an **orthonormal** basis by successive projection and normalization. Basis for **QR decomposition**. *(Sources: LA4)* | |
| **Hadamard Product** — Element-wise matrix product (A ⊙ B)ᵢⱼ = aᵢⱼbᵢⱼ. Distinct from standard matrix multiplication; used in gating mechanisms in neural networks. *(Sources: MLP)* | |
| **Hermitian Matrix** — Complex generalization of symmetric matrices: A = Aᴴ (conjugate transpose). Has real eigenvalues and orthonormal eigenvectors. *(Sources: MC)* | |
| **Householder Reflection** — An orthogonal transformation that reflects vectors across a hyperplane. Used as a numerically stable building block for **QR decomposition**. | |
| **Invertible Matrix** — A square matrix A for which A⁻¹ exists, satisfying AA⁻¹ = I. Equivalent conditions: det(A) ≠ 0, full rank, trivial null space. *(Sources: ACPS, ITA, LADR)* | |
| **Jordan Normal Form** — A near-diagonal representation A = PJP⁻¹ for matrices that are not fully diagonalizable. Jordan blocks appear when eigenvalues have algebraic multiplicity > geometric multiplicity. | |
| **Kernel (Null Space)** — The set of all x such that Ax = 0. Its dimension is the nullity; rank + nullity = number of columns (rank-nullity theorem). *(Sources: AFO, GPML, LADR, MLP, OCV, OPA)* | |
| **Kronecker Product** — A ⊗ B produces a block matrix where each element of A is multiplied by the entire matrix B. Used in multi-dimensional signal processing and multi-task learning. *(Sources: MC, MLP)* | |
| **LU Decomposition** — Factorization A = LU (with optional permutation P) where L is lower triangular and U is upper triangular. Efficient algorithm for solving linear systems. *(Sources: MC)* | |
| **Linear Combination** — A sum Σᵢ cᵢvᵢ of vectors scaled by scalars cᵢ. The set of all linear combinations of a set of vectors is their **span**. *(Sources: ACPS, AFO, DMA, IALA, LA4, LADR)* | |
| **Linear Independence** — Vectors v₁,…,vₙ are linearly independent if Σᵢ cᵢvᵢ = 0 implies all cᵢ = 0. Independence is necessary for forming a **basis**. *(Sources: DMA, IALA, ILA, ITA, LA4, LADR, OPA, PSM)* | |
| **Linear Map (Linear Transformation)** — A function T: V → W satisfying T(au + bv) = aT(u) + bT(v). Represented by a matrix once bases are chosen. *(Sources: LA4, LADR)* | |
| **Matrix Factorization** — Expressing a matrix as a product of structured factors (e.g., **SVD**, **LU decomposition**, **QR decomposition**, **Cholesky decomposition**). Core technique for dimensionality reduction, recommendation systems, and numerical solvers. *(Sources: AFO, MLP)* | |
| **Matrix Norm** — A norm on the space of matrices. The spectral norm (largest singular value), **Frobenius norm**, and nuclear norm are most common; nuclear norm = sum of singular values. *(Sources: AFO)* | |
| **Moore-Penrose Pseudoinverse** — Generalization A⁺ of matrix inverse to non-square or singular matrices. Gives the minimum-norm least-squares solution x = A⁺b. *(Sources: MC)* | |
| **Norm** — A function ‖·‖ measuring vector magnitude, satisfying positivity, homogeneity, and the triangle inequality. Common: L1 (sum of absolutes), L2 (Euclidean), L∞ (max absolute), Lp. *(Sources: AFO, GPML, IALA, LADR, OPA)* | |
| **Orthogonal Complement** — The subspace V⊥ of all vectors orthogonal to every vector in V. R^n = V ⊕ V⊥; used in **projection** and **least squares**. *(Sources: AFO, IALA, LA4, LADR, MC, OPA)* | |
| **Orthogonal Matrix** — A square matrix Q with QᵀQ = I. Preserves lengths and angles; its inverse equals its transpose. Columns (and rows) form an **orthonormal** basis. *(Sources: AFO, LA4)* | |
| **Orthogonality** — Two vectors u, v are orthogonal if ⟨u,v⟩ = 0. Orthogonal sets simplify computation and underlie **projection** and **Gram-Schmidt**. *(Sources: ATD, ILA, PSDS)* | |
| **Orthonormality** — A set of vectors that are mutually orthogonal and each has unit norm. The standard basis is orthonormal; **QR decomposition** produces orthonormal columns. *(Sources: IALA, ITA, LA4, LADR)* | |
| **Outer Product** — For vectors u ∈ ℝᵐ and v ∈ ℝⁿ, the m×n matrix uvᵀ. Rank-1 matrices; used in gradient computation and **SVD** decomposition. *(Sources: IALA, ITA)* | |
| **Positive Definite Matrix** — A symmetric matrix A such that xᵀAx > 0 for all nonzero x. All eigenvalues positive; admits **Cholesky decomposition**. Positive semi-definite allows equality. *(Sources: GPML)* | |
| **Projection** — The component of vector b onto subspace V: proj_V(b) = A(AᵀA)⁻¹Aᵀb where columns of A span V. Minimizes distance from b to V; core of **least squares**. *(Sources: DRL, LA4, LADR, MLP, PCVP)* | |
| **QR Decomposition** — Factorization A = QR where Q has orthonormal columns and R is upper triangular. Used in **least squares**, eigenvalue algorithms, and stable linear solvers. *(Sources: MLP)* | |
| **Rank** — The dimension of the **column space** of a matrix, equaling the number of linearly independent rows or columns. rank(A) = rank(Aᵀ) = number of nonzero **singular values**. *(Sources: ITA, LA4, MC, OPA)* | |
| **Rank-Nullity Theorem** — For a linear map T: V → W, dim(V) = rank(T) + nullity(T). Fundamental relationship between the image and kernel of a linear map. | |
| **Schur Decomposition** — A = QUQᴴ where Q is unitary and U is upper triangular. Exists for every square matrix; eigenvalues appear on the diagonal of U. *(Sources: OPA)* | |
| **Singular Value Decomposition (SVD)** — Factorization A = UΣVᵀ where U, V are orthogonal and Σ is diagonal with nonneg entries (singular values). The most important matrix decomposition; basis for **PCA**, **pseudoinverse**, low-rank approximation, and **matrix norm** computation. *(Sources: MLP)* | |
| **Singular Values** — The diagonal entries σᵢ of Σ in **SVD**, equal to √λᵢ(AᵀA). The largest singular value is the spectral norm; their sum is the nuclear norm. *(Sources: AFO, LADR, MLP, OPA)* | |
| **Span** — The set of all **linear combinations** of a given set of vectors. The span of the columns of A equals the **column space** of A. *(Sources: DMA, LA4, LADR)* | |
| **Spectral Theorem** — A symmetric (or Hermitian) matrix can be diagonalized by an **orthogonal** matrix: A = QΛQᵀ. Eigenvalues are real; eigenvectors are orthogonal. *(Sources: LADR)* | |
| **Symmetric Matrix** — A matrix satisfying A = Aᵀ. Has real eigenvalues, orthogonal eigenvectors, and a complete spectral decomposition (see **Spectral Theorem**). *(Sources: LA4, LADR)* | |
| **Tensor** — A multidimensional array generalizing scalars (0-d), vectors (1-d), and matrices (2-d). Tensor operations (contraction, outer product) are fundamental in deep learning frameworks. | |
| **Trace** — The sum of diagonal entries tr(A) = Σᵢ aᵢᵢ = sum of **eigenvalues** = sum of squared **singular values** (for AᵀA). Cyclic: tr(ABC) = tr(CAB). *(Sources: AFO, LA4, LADR, MC, MLP)* | |
| **Triangle Inequality** — ‖u + v‖ ≤ ‖u‖ + ‖v‖ for any norm. Generalizes to metric spaces and is a defining property of norms. *(Sources: ACPS, AFO, DVCTG, ITA, LA4, LADR, MLP, PSS)* | |
| **Vector Space** — A set V with addition and scalar multiplication satisfying eight axioms (closure, associativity, distributivity, etc.). Examples: ℝⁿ, spaces of functions, polynomial spaces. *(Sources: DMA, LA4, LADR)* | |
| --- | |
| ## 2. Probability & Statistics | |
| **Bayes' Theorem** — P(A|B) = P(B|A)P(A)/P(B). Inverts conditional probabilities; foundational for Bayesian inference, relating prior belief P(A) to posterior P(A|B) after observing B. *(Sources: GPML)* | |
| **Bayesian Inference** — Updating a prior distribution P(θ) with observed data via **Bayes' Theorem** to obtain a posterior P(θ|data) ∝ P(data|θ)P(θ). Provides full uncertainty quantification. *(Sources: CASI, DMMD, MLP, PSC)* | |
| **Beta Distribution** — A conjugate prior for Bernoulli/binomial likelihoods with support [0,1]. Parameterized by shape parameters α, β; mean = α/(α+β). *(Sources: MLP)* | |
| **Binomial Distribution** — The number of successes in n independent Bernoulli trials with success probability p. Mean = np, variance = np(1−p). *(Sources: ITA, MLP)* | |
| **Central Limit Theorem** — The normalized sum of n iid random variables with finite variance converges in distribution to a standard normal as n → ∞. Justifies Gaussian approximations throughout statistics. *(Sources: DSP, IP, MLCS, MLP, PSDS)* | |
| **Chebyshev's Inequality** — P(|X − μ| ≥ kσ) ≤ 1/k². Distribution-free bound on tail probabilities; tighter bounds (Hoeffding, Chernoff) available for specific distributions. *(Sources: ACPS)* | |
| **Chi-Squared Distribution** — Distribution of a sum of squared independent standard normals. Used in goodness-of-fit tests, independence tests, and confidence intervals for variance. *(Sources: MLP)* | |
| **Conditional Independence** — X ⊥ Y | Z means X and Y are independent given Z. Central concept in **probabilistic graphical models** and Bayesian networks. *(Sources: ADFM, MLCS, MLP)* | |
| **Conditional Probability** — P(A|B) = P(A∩B)/P(B). The probability of A given that B has occurred; the foundation of **Bayes' Theorem** and graphical models. *(Sources: DLG, IP, ITA, MLP, PSDS, TB)* | |
| **Confidence Interval** — A range computed from sample data that contains the true parameter with a specified probability (e.g., 95%) under repeated sampling. Not a Bayesian credible interval. *(Sources: MLP)* | |
| **Conjugate Prior** — A prior distribution whose posterior (after likelihood update) is in the same family. Enables closed-form Bayesian updates; e.g., Beta–Bernoulli, Gaussian–Gaussian. *(Sources: MLP, PSDS)* | |
| **Covariance** — Cov(X,Y) = E[(X−μₓ)(Y−μᵧ)]. Measures linear dependence between two random variables; the diagonal of the **covariance matrix** contains variances. *(Sources: GPML, MLP, PSDS, PSM)* | |
| **Covariance Matrix** — Σ = E[(X−μ)(X−μ)ᵀ] for a random vector X. Symmetric positive semi-definite; encodes variances and pairwise covariances; used in **PCA** and **multivariate Gaussian**. *(Sources: GPML, MLP)* | |
| **Dirichlet Distribution** — Multivariate generalization of the **Beta distribution** over the probability simplex. Conjugate prior for multinomial likelihoods; used in topic models (LDA). *(Sources: AFO, MC, MLCS, MLP)* | |
| **Empirical Risk Minimization (ERM)** — Learning by minimizing the average loss on training data. The foundation of supervised learning; generalization theory studies when ERM solutions generalize. *(Sources: MLCS, MLP)* | |
| **Entropy** — H(X) = −Σₓ p(x) log p(x). Measures the average uncertainty (or information content) of a random variable. Maximum for uniform distributions; zero for deterministic ones. *(Sources: ADFM, GPML, MLCS, MLP, NLPP-nlp-python-nltk-oreilly)* | |
| **Exchangeability** — A sequence of random variables is exchangeable if its joint distribution is invariant to permutation. De Finetti's theorem: any exchangeable sequence is conditionally iid. *(Sources: DMMD)* | |
| **Expectation** — E[X] = Σₓ x·p(x) (discrete) or ∫ x·f(x)dx (continuous). The probability-weighted average; linear: E[aX+bY] = aE[X]+bE[Y]. *(Sources: AT, ITA, PSC, PSDS)* | |
| **Exponential Distribution** — Models time between Poisson process events; density f(x) = λe^{−λx}. Memoryless property: P(X>s+t|X>s) = P(X>t). *(Sources: AT, MLP)* | |
| **Exponential Family** — A parametric family of distributions with density f(x|η) = h(x)exp(ηᵀT(x) − A(η)). Encompasses Gaussian, Bernoulli, Poisson, Gamma; enables unified inference via natural parameters. *(Sources: MLCS, MLP, PSC)* | |
| **Fisher Information** — I(θ) = E[(∂/∂θ log p(x|θ))²]. Measures how much data informs about the parameter θ; bounds the variance of unbiased estimators via the **Cramér-Rao bound**. *(Sources: GPML, MLP)* | |
| **Gamma Distribution** — Generalizes the exponential; models positive continuous quantities. Shape k and rate β; Gamma(1, β) = Exponential(β); Gamma(n/2, 1/2) = χ²(n). *(Sources: GPML, MLP)* | |
| **Gaussian (Normal) Distribution** — N(μ, σ²) with density (2πσ²)^{−1/2} exp(−(x−μ)²/2σ²). Ubiquitous due to the **Central Limit Theorem**; fully characterized by mean and variance. *(Sources: MLCS, MLP)* | |
| **Gaussian Process** — A distribution over functions such that any finite collection of function values is jointly Gaussian. Defined by a mean function and **kernel (covariance) function**; used for non-parametric regression and Bayesian optimization. *(Sources: AFO, GPML, MLP, PSDS)* | |
| **Hypothesis Testing** — A statistical framework for deciding between a null hypothesis H₀ and alternative H₁ using a test statistic and significance level α. Controls Type I (false positive) and Type II (false negative) error rates. *(Sources: CASI, PSC, PSDS, PSM, TB)* | |
| **Jensen's Inequality** — For a convex function f and random variable X: E[f(X)] ≥ f(E[X]). Equality iff X is degenerate or f is linear; underlies the **ELBO** and many information-theoretic bounds. *(Sources: ITA, MLP)* | |
| **Joint Distribution** — The probability distribution of two or more random variables simultaneously. Marginalization recovers individual distributions; independence implies factorization. *(Sources: MLP)* | |
| **KL Divergence** — D_KL(P‖Q) = Σₓ P(x) log(P(x)/Q(x)). Measures the information lost using Q to approximate P. Non-negative (**Gibbs' inequality**), asymmetric; used in variational inference and information theory. *(Sources: MLCS, MLP)* | |
| **Law of Large Numbers** — The sample mean converges to the population mean E[X] as n → ∞ (weak: in probability; strong: almost surely). Foundational justification for empirical estimation. *(Sources: IP, PSDS)* | |
| **Likelihood** — L(θ; x) = p(x|θ) viewed as a function of θ for fixed data x. **MLE** maximizes likelihood; **MAP** combines likelihood with a prior. *(Sources: DMMD, GPML, MLCS, MLP, OPA, TB)* | |
| **Marginal Distribution** — The distribution of a subset of variables obtained by summing (integrating) out the others from the **joint distribution**. *(Sources: AFO, MLP)* | |
| **Maximum A Posteriori (MAP)** — θ_MAP = argmax P(θ|data) = argmax P(data|θ)P(θ). Bayesian point estimate that reduces to **MLE** when the prior is uniform; equivalent to regularized MLE. *(Sources: GPML, MLP)* | |
| **Maximum Likelihood Estimation (MLE)** — θ_MLE = argmax P(data|θ). Finds parameters that make observed data most probable. Consistent and asymptotically efficient for regular models. *(Sources: DLG, MLCS, MLP)* | |
| **Moment Generating Function** — M_X(t) = E[e^{tX}]. Uniquely characterizes a distribution; the k-th moment equals the k-th derivative at t=0. Useful for proving the **Central Limit Theorem**. *(Sources: PSM)* | |
| **Monte Carlo Methods** — Algorithms that use random sampling to estimate quantities (integrals, expectations, probabilities). Accuracy scales as O(1/√n) regardless of dimension. *(Sources: DLG, RL)* | |
| **Multivariate Gaussian** — N(μ, Σ) over ℝᵈ with density (2π)^{−d/2}|Σ|^{−1/2} exp(−½(x−μ)ᵀΣ⁻¹(x−μ)). Conditionals and marginals are Gaussian; used pervasively in ML. *(Sources: MLCS, MLP)* | |
| **P-value** — The probability, under H₀, of observing a test statistic at least as extreme as the observed value. Not the probability that H₀ is true; misinterpretation is common. *(Sources: MLP)* | |
| **Poisson Distribution** — Models counts of rare events; P(X=k) = λᵏe^{−λ}/k!. Arises as the limit of binomial with large n, small p, fixed λ=np. | |
| **Probability Density Function (PDF)** — Function f such that P(a ≤ X ≤ b) = ∫ₐᵇ f(x)dx. Must integrate to 1 and be non-negative. *(Sources: DRL, ITA, MLP)* | |
| **Rejection Sampling** — A Monte Carlo method for sampling from a target distribution p by drawing from a proposal q and accepting with probability p(x)/(Mq(x)). Exact but inefficient in high dimensions. *(Sources: MLP, PSC, PSDS)* | |
| **Sufficient Statistic** — A statistic T(x) that captures all information in the data about parameter θ: P(x|T(x),θ) = P(x|T(x)). The **exponential family** has natural sufficient statistics. *(Sources: OPA)* | |
| **t-Distribution** — Heavy-tailed distribution arising as the ratio of a standard normal to the square root of a chi-squared variable. Approaches Gaussian as degrees of freedom → ∞; used for small-sample inference. *(Sources: MLCS)* | |
| **Variance** — Var(X) = E[(X−E[X])²] = E[X²] − (E[X])². Measures spread; Var(aX+b) = a²Var(X). The square root is the standard deviation. *(Sources: AFO, DRL, DSP, IP, ITA, MLP, PSC, PSM)* | |
| --- | |
| ## 3. Information Theory | |
| **Channel Capacity** — The maximum mutual information between input and output of a noisy channel, optimized over input distributions. Shannon's coding theorem: reliable communication is possible iff the rate < capacity. | |
| **Cross-Entropy** — H(P,Q) = −Σₓ P(x) log Q(x) = H(P) + D_KL(P‖Q). Commonly used as a loss function for classification; minimizing cross-entropy is equivalent to minimizing KL divergence. *(Sources: AFO, DRL, MLP)* | |
| **Differential Entropy** — h(X) = −∫ f(x) log f(x) dx. Continuous analog of **entropy**; can be negative and is not invariant to reparameterization. *(Sources: OPA)* | |
| **Entropy** — H(X) = −Σₓ p(x) log p(x). Measures the average bits needed to encode a random variable optimally. Maximum entropy subject to moment constraints gives the **exponential family**. *(Sources: ADFM, GPML, MLCS, MLP, NLPP-nlp-python-nltk-oreilly)* | |
| **Joint Entropy** — H(X,Y) = −Σ_{x,y} p(x,y) log p(x,y). Satisfies H(X,Y) ≤ H(X) + H(Y) with equality iff X, Y are independent. | |
| **KL Divergence** — (See Probability & Statistics section.) D_KL(P‖Q) ≥ 0 with equality iff P = Q; not symmetric. The natural divergence for variational inference. *(Sources: MLCS, MLP)* | |
| **Mutual Information** — I(X;Y) = H(X) + H(Y) − H(X,Y) = D_KL(P_{XY} ‖ P_X P_Y). Measures dependence between X and Y; zero iff independent. Used in feature selection, ICA, and representation learning. *(Sources: MLCS, MLP, PAI)* | |
| **Rate-Distortion Theory** — Characterizes the minimum encoding rate to represent a source within a given distortion level. Dual to channel capacity; relevant to lossy compression and **VAE** objectives. | |
| --- | |
| ## 4. Optimization | |
| **Accelerated Gradient Descent (Nesterov Momentum)** — Gradient descent using momentum computed at a lookahead point: θ_{t+1} = y_t − α∇f(y_t), y_{t+1} = θ_{t+1} + β(θ_{t+1}−θ_t). Achieves optimal O(1/t²) convergence for smooth convex functions. *(Sources: AFO)* | |
| **Adaptive Learning Rate Methods** — Optimizers that adjust per-parameter learning rates using gradient history. Includes **AdaGrad**, **RMSProp**, and **Adam**; generally more robust than fixed-rate SGD. *(Sources: MLP)* | |
| **AdaGrad** — Divides the learning rate by the square root of the sum of squared past gradients. Effective for sparse gradients but learning rate decays to zero over time. *(Sources: AFO, MLP)* | |
| **Adam** — Combines **momentum** (first moment) and **RMSProp** (second moment) with bias correction. Default optimizer for most deep learning; converges well in practice but may not find the globally optimal solution. *(Sources: AFO)* | |
| **Augmented Lagrangian** — Adds a quadratic penalty term to the **Lagrangian**: L_ρ(x,λ) = f(x) + λᵀg(x) + (ρ/2)‖g(x)‖². Improves convergence over pure dual methods; basis for **ADMM**. *(Sources: OPA)* | |
| **Backtracking Line Search** — Iteratively reduces the step size until the Armijo sufficient decrease condition is satisfied. Ensures descent at each step without requiring exact line search. *(Sources: AFO)* | |
| **BFGS / L-BFGS** — Quasi-Newton methods that approximate the inverse Hessian using gradient differences. L-BFGS stores only the last m vector pairs, making it memory-efficient for large-scale optimization. *(Sources: AFO, MLCS, MLP, OPA)* | |
| **Conjugate Gradient** — An iterative solver for linear systems Ax = b (symmetric positive definite A) and a gradient-based optimizer. Converges in at most n steps exactly; used in large-scale linear systems. *(Sources: AFO)* | |
| **Constrained Optimization** — Minimizing f(x) subject to equality g(x) = 0 and/or inequality h(x) ≤ 0 constraints. Solved via **Lagrange multipliers**, **KKT conditions**, or interior-point methods. *(Sources: AFO, DLG, OPA)* | |
| **Convex Function** — A function f where f(λx + (1−λ)y) ≤ λf(x) + (1−λ)f(y) for all λ ∈ [0,1]. Local minima are global minima; gradient descent converges to global optimum. *(Sources: AFO, AT, ITA)* | |
| **Convex Set** — A set C where any line segment between two points in C lies in C. Intersection of convex sets is convex; convex optimization searches over convex sets. *(Sources: AFO, LA4)* | |
| **Duality** — Every optimization problem (primal) has a dual problem whose optimal value bounds the primal. Strong duality (primal = dual) holds under **Slater's condition** for convex problems. *(Sources: AFO, CG, CO, DSP, ITA, OPA)* | |
| **Frank-Wolfe Algorithm** — Conditional gradient method that replaces the gradient step with a linear minimization over the feasible set. Memory-efficient for structured constraint sets (e.g., nuclear norm ball). | |
| **Gradient Descent** — Iterative update θ ← θ − α∇f(θ). Converges to local (global for convex) minima. Step size (learning rate) α must be chosen carefully; **line search** or schedules are used. *(Sources: AFO, CG, DLP, FDL, MARL, MLCS, MLP, PDL)* | |
| **Interior-Point Method** — Solves constrained optimization by traversing the interior of the feasible region using a barrier function. Polynomial-time complexity for convex problems; the basis of modern LP/QP/SDP solvers. *(Sources: ITA)* | |
| **KKT Conditions** — Karush-Kuhn-Tucker: first-order necessary (and sufficient for convex problems) conditions for constrained optima. Stationarity + primal feasibility + dual feasibility + complementary slackness. *(Sources: AFO)* | |
| **Lagrange Multipliers** — A method for finding extrema of f(x) subject to g(x) = 0 by solving ∇f = λ∇g. Extends to inequality constraints via **KKT conditions**. *(Sources: AFO, IALA, MLP)* | |
| **Lagrangian** — L(x, λ, μ) = f(x) + λᵀg(x) + μᵀh(x). Encodes the constrained problem as an unconstrained saddle-point problem; dual function is g(λ,μ) = inf_x L(x,λ,μ). *(Sources: AFO, MLP, OPA)* | |
| **Learning Rate** — The step size α in gradient-based optimization. Too large → divergence; too small → slow convergence. Schedules (cosine annealing, warmup, step decay) are common. *(Sources: AFO, MARL, MLP, PDL)* | |
| **Lipschitz Continuity** — ‖f(x)−f(y)‖ ≤ L‖x−y‖. L-smooth functions (Lipschitz gradient) admit gradient descent with step size 1/L. Required for convergence guarantees. *(Sources: DRL)* | |
| **Momentum** — Adds a fraction of the previous update to the current gradient step. Dampens oscillations and accelerates convergence in ravine-shaped loss surfaces. *(Sources: AFO, MARL, MLP, PDL)* | |
| **Newton's Method** — θ ← θ − H⁻¹∇f where H is the Hessian. Quadratic convergence near a minimum but expensive O(d³) Hessian inversion; approximated by quasi-Newton methods. *(Sources: AFO, CG, CO, GPML, MLCS, MLP, OPA)* | |
| **Proximal Gradient Method** — For objectives f(x) + g(x) where g is non-smooth, alternates between a gradient step on f and a proximal step (soft-thresholding for L1). Basis of ISTA/FISTA. | |
| **RMSProp** — Divides the gradient by the running average of its recent magnitudes. Fixes AdaGrad's diminishing learning rate; used in **Adam** as the second-moment estimate. *(Sources: AFO)* | |
| **Saddle Point** — A critical point that is a minimum in some directions and maximum in others. Common in non-convex deep learning loss surfaces; gradient descent often escapes them via noise. *(Sources: MLP)* | |
| **Simulated Annealing** — A probabilistic global optimization heuristic that accepts worse solutions with decreasing probability over time (analogous to physical annealing). Avoids local minima at the cost of slow convergence. *(Sources: AFO, CI, MLP, OPA)* | |
| **Slater's Condition** — If a strictly feasible point exists for a convex program (h(x) < 0), strong duality holds. Ensures the **duality gap** is zero and **KKT conditions** are sufficient. *(Sources: AFO)* | |
| **Stochastic Gradient Descent (SGD)** — Gradient descent using a single (or mini-batch) data point to estimate the gradient. Noisy but cheap per step; the workhorse of large-scale ML and deep learning. *(Sources: ADNN, AFO, DLG, DRL, MARL, MLCS, MLP, PDL)* | |
| **Strong Convexity** — f is m-strongly convex if f(x) − m/2‖x‖² is convex. Implies a unique global minimum; guarantees exponential convergence of gradient descent. *(Sources: AJE)* | |
| **Subgradient** — A generalization of gradient to non-differentiable convex functions: g is a subgradient at x if f(y) ≥ f(x) + gᵀ(y−x) for all y. Used in **lasso** and **SVM** training. *(Sources: AT, MLP)* | |
| **Warm Restarts (Cosine Annealing)** — Learning rate schedule that periodically resets the learning rate to a high value and decays cosinely. Encourages exploration of multiple loss basins; improves ensemble diversity. | |
| --- | |
| ## 5. Classical Machine Learning | |
| **AdaBoost** — Ensemble method that sequentially trains weak classifiers, reweighting training examples to focus on previously misclassified ones. Final prediction is a weighted vote; connections to exponential loss minimization. *(Sources: MLCS, MLP)* | |
| **Bias-Variance Tradeoff** — Generalization error = Bias² + Variance + Noise. High-capacity models overfit (high variance); low-capacity models underfit (high bias). Regularization and ensembling manage this tradeoff. *(Sources: CG)* | |
| **Bootstrapping** — Resampling with replacement from the dataset to estimate sampling distributions or create ensemble members (as in **Random Forests**). The bootstrap distribution approximates the true sampling distribution. *(Sources: DRL, MARL)* | |
| **Cross-Validation** — Splitting data into folds to estimate generalization error without a separate test set. k-fold CV averages performance across k train/validation splits; leave-one-out CV is the extreme case. *(Sources: CASI, CI, DSCL, GPML, ISL, NLPP-nlp-python-nltk-oreilly, PDSH)* | |
| **DBSCAN** — Density-based clustering that finds arbitrarily shaped clusters and labels outliers as noise. Core points have at least minPts neighbors within radius ε; no need to specify number of clusters. | |
| **Decision Tree** — A hierarchical classifier/regressor that recursively partitions the feature space using axis-aligned splits. Splits chosen to maximize information gain or minimize impurity (Gini, entropy). *(Sources: ITA)* | |
| **Dimensionality Reduction** — Mapping high-dimensional data to a lower-dimensional representation preserving relevant structure. Linear methods: **PCA**, **LDA**; nonlinear: **t-SNE**, **UMAP**, **autoencoders**. *(Sources: CG, CVFA, DMA, DSCL, MLP, PDSH)* | |
| **Elastic Net** — Regularization combining L1 (**lasso**) and L2 (**ridge**) penalties: α‖β‖₁ + (1−α)‖β‖₂². Performs variable selection like lasso while handling correlated features like ridge. *(Sources: MLP)* | |
| **Expectation-Maximization (EM) Algorithm** — Iterates between E-step (compute posterior over latent variables) and M-step (maximize expected log-likelihood). Converges to a local maximum of the likelihood; used in **GMMs**, HMMs, and missing-data problems. *(Sources: PDSH)* | |
| **Feature Engineering** — Transforming raw inputs into representations that improve model performance. Includes normalization, binning, interaction features, embeddings, and domain-specific transformations. *(Sources: CG, DLP, IALA, PDL, PDSH)* | |
| **Gaussian Mixture Model (GMM)** — A probabilistic model of the form p(x) = Σₖ πₖ N(x|μₖ,Σₖ). Fit by **EM**; generalizes k-means by assigning soft cluster memberships. *(Sources: MLP)* | |
| **Gradient Boosting** — Builds an additive ensemble of weak learners (typically trees) by fitting each new learner to the residuals (negative gradient of the loss). XGBoost, LightGBM, and CatBoost are leading implementations. *(Sources: CASI, MLP)* | |
| **Independent Component Analysis (ICA)** — Decomposes a multivariate signal into maximally statistically independent components. Assumes non-Gaussian latent sources; used in blind source separation and neuroscience. *(Sources: DLG, MLCS, MLP)* | |
| **K-Means Clustering** — Assigns n points to k clusters by alternating between assigning points to nearest centroids and updating centroids as cluster means. Minimizes within-cluster sum of squared distances; sensitive to initialization. *(Sources: CG, CI, ISL, PCVP, PDSH, PSM)* | |
| **K-Nearest Neighbors (KNN)** — Non-parametric method that classifies a point by majority vote (or averaging) among its k nearest training points. No training phase; prediction cost O(n). *(Sources: CI, ISL, PCVP)* | |
| **Kernel Methods** — Algorithms that implicitly operate in a high-dimensional feature space via a kernel function k(x,y) = ⟨φ(x),φ(y)⟩. The kernel trick avoids explicit feature map computation; used in **SVMs** and **Gaussian processes**. *(Sources: DMA, PDL)* | |
| **Lasso (L1 Regression)** — Linear regression with L1 penalty λΣ|βⱼ|. Produces sparse solutions (many zero coefficients) via soft-thresholding; equivalent to a Laplace prior in Bayesian terms. *(Sources: AFO, MLP)* | |
| **Latent Dirichlet Allocation (LDA)** — A probabilistic topic model where documents are mixtures of topics and topics are distributions over words. Inferred by variational EM or MCMC; the **Dirichlet** prior encourages sparsity. *(Sources: MLP)* | |
| **Linear Discriminant Analysis (LDA)** — Finds a linear projection maximizing between-class variance relative to within-class variance. Both a dimensionality reduction and a Bayes-optimal classifier for equal-covariance Gaussians. *(Sources: MLP)* | |
| **Linear Regression** — Models E[y|x] = xᵀβ. Fit by **OLS** (minimizing squared errors) or **ridge**/**lasso** (regularized). The **normal equations** give a closed-form solution. *(Sources: ADFM, ADNN, AFO, GPML, ISL, MC, MLCS, MLP, PAI, PDSH, PSC, PSDS, PSM)* | |
| **Logistic Regression** — Models P(y=1|x) = σ(xᵀβ) where σ is the sigmoid. Despite its name, it is a classifier; fit by maximizing log-likelihood (equivalent to minimizing **cross-entropy**). *(Sources: CASI, CG, GPML, ISL, MLCS, MLP, PDL)* | |
| **Naive Bayes** — A generative classifier assuming all features are conditionally independent given the class. Efficient and effective despite the (usually violated) independence assumption; strong for text classification. *(Sources: ISL)* | |
| **Non-Negative Matrix Factorization (NMF)** — Approximates a matrix V ≈ WH with W, H ≥ 0. Produces interpretable parts-based decompositions; used in topic modeling and image decomposition. *(Sources: CI, MLP)* | |
| **Ordinary Least Squares (OLS)** — Minimizes ‖y − Xβ‖². Solution β = (XᵀX)⁻¹Xᵀy (normal equations). BLUE estimator (Gauss-Markov); numerically solved via **QR decomposition**. *(Sources: CG, MLP)* | |
| **Overfitting** — When a model fits training data too well but fails to generalize. Diagnosed by a gap between training and validation loss; addressed by **regularization**, **dropout**, or more data. *(Sources: MLCS, MLP, PCVP, PSDS)* | |
| **Principal Component Analysis (PCA)** — Projects data onto directions of maximum variance (principal components), obtained from the eigenvectors of the covariance matrix or by **SVD**. Reduces dimensionality while preserving variance. *(Sources: PDSH)* | |
| **Random Forest** — An ensemble of decision trees trained on bootstrap samples with random feature subsets at each split. Reduces variance via averaging; robust to overfitting and competitive on tabular data. *(Sources: PDSH)* | |
| **Receiver Operating Characteristic (ROC) Curve** — Plots true positive rate vs. false positive rate across all classification thresholds. Area under the ROC curve (AUC) is a threshold-independent performance measure. *(Sources: MLP)* | |
| **Regularization** — Adding a penalty to the loss function to constrain model complexity: L1 (sparsity), L2 (weight shrinkage), or early stopping. Reduces overfitting by biasing toward simpler solutions. *(Sources: AFO, DLP, GPML, MLP, PDL, PDSH)* | |
| **Ridge Regression (L2 Regression)** — Adds L2 penalty λ‖β‖² to least squares. Closed-form solution β = (XᵀX + λI)⁻¹Xᵀy; shrinks coefficients continuously; equivalent to Gaussian prior. *(Sources: AFO, ATD, CASI, CG, GPML, ISL, MLP, PDSH)* | |
| **Support Vector Machine (SVM)** — Finds the maximum-margin hyperplane separating classes. The **kernel trick** extends it to nonlinear boundaries; the dual formulation involves only dot products. *(Sources: GPML, ISL, MLP)* | |
| **t-SNE** — Non-linear dimensionality reduction that preserves local neighborhood structure by minimizing KL divergence between pairwise similarity distributions in high and low dimensions. Used for visualization; not for general embedding. | |
| **UMAP** — Uniform Manifold Approximation and Projection: faster than **t-SNE**, preserves more global structure, and can be used for out-of-sample embedding. Based on Riemannian geometry and fuzzy topology. *(Sources: DVCTG)* | |
| --- | |
| ## 6. Probabilistic Graphical Models | |
| **Belief Propagation** — An exact inference algorithm on trees (and approximate on graphs with cycles, as **loopy BP**) that passes messages encoding marginal beliefs. Sum-product for marginals; max-product for MAP. *(Sources: ADFM, MLP)* | |
| **Conditional Random Field (CRF)** — A discriminative undirected graphical model over output sequences given inputs. Used in structured prediction (NLP, CV); avoids the label bias problem of HMMs. *(Sources: MLP)* | |
| **Hidden Markov Model (HMM)** — A sequential latent variable model with Markov transitions between hidden states and conditionally independent observations. Trained by **EM (Baum-Welch)**; decoded by Viterbi. *(Sources: GPML, MLCS, MLP)* | |
| **Markov Random Field (MRF)** — An undirected graphical model factoring as a product of potential functions over cliques. Encodes spatial dependencies; used in image segmentation and CRFs. *(Sources: GPML, MLP)* | |
| **Variational Inference** — Approximates an intractable posterior p(z|x) by optimizing a tractable family q(z) to minimize **KL divergence**. Turns inference into optimization; scalable with stochastic gradients (**ELBO** maximization). *(Sources: MLCS, MLP, PAI)* | |
| --- | |
| ## 7. Neural Networks & Deep Learning | |
| **Activation Function** — A nonlinear function applied element-wise after each linear layer. ReLU (max(0,x)) is the default; sigmoid and tanh saturate; GELU and SiLU are used in transformers. *(Sources: MARL, OPA, PDL)* | |
| **Attention Mechanism** — Computes a weighted sum of values, where weights come from a compatibility function between queries and keys: Attention(Q,K,V) = softmax(QKᵀ/√dₖ)V. Allows dynamic, content-based feature selection. *(Sources: AIE)* | |
| **Backpropagation** — Efficient computation of gradients in a neural network via the chain rule, propagating error signals from output to input. Runs in O(forward pass) time; requires storing activations. *(Sources: DLP, ISL, MARL, MLP, PDL)* | |
| **Batch Normalization** — Normalizes pre-activation values within a mini-batch, then scales and shifts with learned parameters. Reduces internal covariate shift, enables higher learning rates, and acts as a regularizer. *(Sources: ADNN, DLP, PDL)* | |
| **Capsule Network** — Architecture that represents entities as capsules (vectors encoding pose and existence probability) with dynamic routing. Aims to model part-whole relationships better than CNNs. | |
| **Convolutional Neural Network (CNN)** — Uses convolutional layers with shared weights and local receptive fields to capture spatial hierarchies. Pooling provides translation invariance; highly successful in computer vision. *(Sources: MLP)* | |
| **Decoder** — The generative half of an encoder-decoder architecture that maps a latent representation back to the output space. In transformers, the decoder uses masked self-attention and cross-attention to the encoder. | |
| **Dropout** — Randomly zeroes activations with probability p during training. Reduces co-adaptation and acts as an ensemble of thinned networks; at test time, activations are scaled by (1−p). *(Sources: DLG, DLP, PDL)* | |
| **Early Stopping** — Halting training when validation loss stops improving. A form of regularization that prevents overfitting; requires a held-out validation set. *(Sources: DLG, MLP, PDL)* | |
| **Embedding** — A dense vector representation of a discrete symbol (word, user, item) learned end-to-end. Captures semantic similarity via geometric proximity; e.g., word2vec, GloVe. *(Sources: AIE, DVCTG, MLP)* | |
| **Encoder** — Maps an input to a latent representation. In **autoencoders**, the encoder compresses; in transformers, the encoder builds contextual representations via stacked self-attention and feed-forward layers. *(Sources: PDL)* | |
| **Feed-Forward Network (FFN)** — A neural network with no cycles; information flows in one direction from input to output. The simplest architecture; universal approximation holds for sufficiently wide single hidden layers. *(Sources: FDL)* | |
| **Fine-Tuning** — Adapting a pre-trained model to a new task by continuing training, usually with a lower learning rate and task-specific head. Exploits learned representations; requires far less data than training from scratch. *(Sources: DLP, PDL)* | |
| **GRU (Gated Recurrent Unit)** — A simplified RNN with reset and update gates. Fewer parameters than **LSTM**; comparable performance on many sequence tasks. *(Sources: PDL)* | |
| **Knowledge Distillation** — Trains a smaller "student" model to match the output distribution (soft labels) of a larger "teacher" model. Transfers generalization capability; can also distill ensembles. *(Sources: MARL)* | |
| **Label Smoothing** — Replaces hard one-hot targets with (1−ε)·one_hot + ε/K. Reduces overconfidence and improves calibration; a form of output regularization. *(Sources: DLP)* | |
| **Layer Normalization** — Normalizes across the feature dimension (not batch). Preferred in transformers; stable for variable-length sequences and small batch sizes. | |
| **LSTM (Long Short-Term Memory)** — RNN variant with input, forget, and output gates controlling information flow through a cell state. Mitigates the vanishing gradient problem; dominant sequence model before transformers. *(Sources: PDL)* | |
| **Multilayer Perceptron (MLP)** — A fully connected feed-forward network with one or more hidden layers. Universal approximator for continuous functions; used as a building block in most architectures. *(Sources: MLP)* | |
| **Perceptron** — The simplest neural model: a single linear threshold unit. Perceptron learning converges if data is linearly separable; precursor to modern neural networks. *(Sources: MLP)* | |
| **Pre-training** — Training a model on a large dataset (often unsupervised) before **fine-tuning** on a downstream task. Enables strong performance with limited labeled data; foundational to transformer-based models. *(Sources: AIE)* | |
| **ReLU (Rectified Linear Unit)** — f(x) = max(0, x). Computationally simple, avoids vanishing gradients (vs. sigmoid/tanh), and enables sparse activations. Variants: Leaky ReLU, PReLU, ELU, GELU. *(Sources: MARL)* | |
| **Residual Connection (Skip Connection)** — Adds the input of a layer directly to its output: y = F(x) + x. Enables training of very deep networks by providing gradient highways; introduced in **ResNet**. *(Sources: ITA)* | |
| **ResNet (Residual Network)** — Deep CNN architecture using **residual connections** to enable training of networks with 50–1000+ layers. Dominated ImageNet classification; residual connections are now ubiquitous. *(Sources: DLP)* | |
| **Recurrent Neural Network (RNN)** — Processes sequences by maintaining a hidden state updated at each time step: h_t = f(h_{t-1}, x_t). Suffers from vanishing/exploding gradients for long sequences; superseded by **LSTM**, **GRU**, and **transformers**. *(Sources: DLP, MLP)* | |
| **Transfer Learning** — Using a model trained on a source task/domain as a starting point for a target task. Most effective when source and target domains are related; **fine-tuning** and feature extraction are the main approaches. *(Sources: ADNN, DLP, MLP)* | |
| **Universal Approximation Theorem** — A sufficiently wide (or deep) neural network can approximate any continuous function on a compact set to arbitrary accuracy. Establishes representational capacity; does not address learnability or generalization. *(Sources: MARL)* | |
| **Vanishing / Exploding Gradients** — Gradients that become exponentially small or large as they propagate through deep networks. Addressed by **residual connections**, **layer normalization**, careful initialization, and gradient clipping. *(Sources: MLP)* | |
| **Weight Initialization** — Strategy for setting initial network weights. Xavier/Glorot initialization scales by 1/√n_in; He initialization scales by 2/√n_in for ReLU. Proper initialization prevents vanishing/exploding gradients. *(Sources: ADNN)* | |
| --- | |
| ## 8. Transformers & Large Language Models | |
| **BERT** — Bidirectional Encoder Representations from Transformers: pre-trained by masked language modeling (MLM) and next sentence prediction. Produces deep bidirectional representations; fine-tuned for classification, NER, QA. | |
| **Cross-Attention** — Attention where queries come from one sequence (e.g., decoder) and keys/values from another (e.g., encoder output). Enables the decoder to attend to input representations. | |
| **Flash Attention** — An IO-aware exact attention algorithm that reorders computations to minimize memory reads/writes between HBM and SRAM. Reduces memory from O(n²) to O(n) and dramatically speeds up transformer training. | |
| **GPT (Generative Pre-trained Transformer)** — A decoder-only transformer pre-trained by causal language modeling (next-token prediction). Scales to billions of parameters; few-shot learner via in-context prompting. *(Sources: PDL)* | |
| **Instruction Tuning** — **Fine-tuning** a language model on datasets of instruction-response pairs. Improves instruction-following capability; often followed by **RLHF** alignment. *(Sources: AIE)* | |
| **Multi-Head Attention** — Runs h parallel **attention** heads with different linear projections, then concatenates outputs. Allows the model to attend to different aspects of the input simultaneously. *(Sources: PDL)* | |
| **Positional Encoding** — Injects sequence order information into transformer inputs (which have no inherent order). Sinusoidal encodings (original Transformer), learned encodings, or relative encodings (RoPE, ALiBi) are common. *(Sources: PDL)* | |
| **RLHF (Reinforcement Learning from Human Feedback)** — Fine-tunes language models using a reward model trained on human preference data, optimized via **PPO**. Aligns model outputs with human values; used in InstructGPT, ChatGPT. *(Sources: AIE)* | |
| **Scaled Dot-Product Attention** — Attention(Q,K,V) = softmax(QKᵀ/√dₖ)V. The √dₖ scaling prevents softmax saturation in high dimensions. Core operation in all transformer architectures. | |
| **Self-Attention** — **Attention** where queries, keys, and values all come from the same sequence. Enables each token to directly attend to all other tokens; O(n²) complexity in sequence length. *(Sources: PDL)* | |
| **Tokenization** — Converting raw text into a sequence of discrete tokens (subword units). BPE (Byte Pair Encoding), WordPiece, and SentencePiece are common algorithms; vocabulary size balances coverage and embedding table size. *(Sources: ADNN, NLPP-nlp-python-nltk-oreilly, PDL)* | |
| **Transformer** — Architecture based entirely on **multi-head attention** and feed-forward layers, without recurrence. Introduced in "Attention Is All You Need"; now dominant in NLP, vision, and multimodal settings. *(Sources: PDL)* | |
| --- | |
| ## 9. Generative Models | |
| **Autoregressive Model** — Factorizes the joint distribution as p(x) = Πᵢ p(xᵢ|x₁,…,xᵢ₋₁). Exact likelihood; sequential sampling; used in PixelCNN, WaveNet, and **GPT**-style LLMs. *(Sources: AIE)* | |
| **Conditional GAN** — A **GAN** conditioned on class labels or other auxiliary information. Enables class-conditional generation; the discriminator also receives the conditioning information. *(Sources: GPML)* | |
| **Denoising Diffusion Probabilistic Model (DDPM)** — Learns to reverse a gradual noising process. Forward: adds Gaussian noise over T steps. Reverse: learns p(x_{t-1}|x_t) with a neural network. Produces high-quality samples; slower than **GANs** but more stable. *(Sources: MLP)* | |
| **Diffusion Model** — A generative model that learns to reverse a gradual noising (diffusion) process. State-of-the-art for image, audio, and video generation; equivalent to score matching. *(Sources: DVCTG)* | |
| **ELBO (Evidence Lower Bound)** — log p(x) ≥ E_q[log p(x,z)] − E_q[log q(z)] = E_q[log p(x|z)] − D_KL(q(z‖x)‖p(z)). The objective maximized in **VAEs** and **variational inference**; a lower bound on the log marginal likelihood. *(Sources: PAI)* | |
| **Flow Matching** — Trains a vector field whose flow transforms a noise distribution to a data distribution. More stable than score matching; continuous-time generalization of normalizing flows. | |
| **GAN (Generative Adversarial Network)** — A generator G and discriminator D trained adversarially: G minimizes and D maximizes log D(x) + log(1−D(G(z))). Produces sharp samples; training is unstable; mode collapse is common. *(Sources: DLP)* | |
| **Latent Diffusion Model** — Applies **diffusion** in a compressed latent space (from a **VAE**), then decodes with the VAE decoder. Reduces computational cost while maintaining generation quality; basis of Stable Diffusion. | |
| **Normalizing Flow** — A sequence of invertible, differentiable transformations mapping a simple distribution to a complex one. Exact likelihood evaluation via change of variables; examples: RealNVP, Glow, NICE. *(Sources: NLPP-nlp-python-nltk-oreilly)* | |
| **Reparameterization Trick** — Writes a stochastic sample z ~ q(z|x) as z = g(x, ε) where ε ~ p(ε) is noise-free. Allows gradient flow through stochastic nodes; essential for training **VAEs**. | |
| **Score Function** — ∇_x log p(x). The gradient of the log density with respect to the input. Score matching learns this function directly; used in **diffusion models** and energy-based models. *(Sources: MLP)* | |
| **Score Matching** — Trains a model to match the **score function** of the data distribution. Avoids computing the partition function; equivalent to **diffusion models** via the denoising score matching formulation. *(Sources: MLP)* | |
| **VAE (Variational Autoencoder)** — An **autoencoder** with a probabilistic latent space: the encoder outputs q(z|x) ≈ N(μ,σ²) and the decoder p(x|z). Trained by maximizing the **ELBO**; enables smooth latent interpolation. *(Sources: MLP)* | |
| **Wasserstein GAN (WGAN)** — Replaces the GAN discriminator loss with the Wasserstein-1 distance, computed via a Lipschitz-constrained critic. More stable training, meaningful loss metric; weight clipping or gradient penalty enforces the Lipschitz constraint. *(Sources: DRL)* | |
| --- | |
| ## 10. Computer Vision | |
| **Anchor Boxes** — Pre-defined bounding boxes of various scales and aspect ratios at each spatial location. Object detection models like Faster R-CNN predict offsets from anchors; YOLO uses a similar approach. | |
| **Canny Edge Detector** — Multi-stage edge detection: Gaussian smoothing → gradient magnitude → non-maximum suppression → hysteresis thresholding. Produces thin, connected edge maps. | |
| **Feature Pyramid Network (FPN)** — A multi-scale feature extractor that combines semantically rich deep features with spatially precise shallow features via top-down pathways and lateral connections. Improves small-object detection. | |
| **HOG (Histogram of Oriented Gradients)** — Computes histograms of gradient orientations in local image patches. Robust to illumination changes; classic pedestrian detection feature before deep learning. *(Sources: PCVP, PDSH)* | |
| **Image Segmentation** — Assigns a class label to every pixel (semantic) or identifies individual object instances (instance). U-Net, Mask R-CNN, and DeepLab are dominant architectures. *(Sources: MLP, PCVP, PDL)* | |
| **Instance Segmentation** — Detects and delineates individual object instances, producing per-instance masks. Extends object detection with a mask prediction head; Mask R-CNN is the standard approach. *(Sources: DLP, PDL)* | |
| **IoU (Intersection over Union)** — Area(A∩B)/Area(A∪B). Standard metric for bounding box overlap in object detection; a prediction is a true positive if IoU > threshold (typically 0.5). *(Sources: ITA)* | |
| **Mean Average Precision (mAP)** — The mean of per-class average precision across IoU thresholds. Standard evaluation metric for object detection; accounts for both precision and recall. *(Sources: MLP)* | |
| **Non-Maximum Suppression (NMS)** — Post-processing step that removes redundant overlapping bounding boxes, keeping only the highest-confidence detection among overlapping predictions. | |
| **Object Detection** — Localizes and classifies objects in an image, outputting bounding boxes and class labels. Two-stage (Faster R-CNN) and single-stage (YOLO, SSD) detectors trade accuracy for speed. *(Sources: CVAA, DLP, MLP, PDL)* | |
| **Optical Flow** — The apparent motion of pixels between frames. Estimated by Lucas-Kanade (sparse) or Horn-Schunck (dense); used in video understanding, action recognition, and video generation. *(Sources: CVAA, PCVP)* | |
| **Pooling** — Reduces spatial dimensions by aggregating values (max, average) over local patches. Global average pooling collapses spatial dimensions entirely; provides translation invariance. *(Sources: DLG, DLP, MARL)* | |
| **SIFT (Scale-Invariant Feature Transform)** — Detects and describes local keypoints invariant to scale, rotation, and illumination changes. Produces 128-dimensional descriptors; foundational for feature matching and structure-from-motion. *(Sources: MLP, PCVP)* | |
| **Stride** — The step size by which a convolutional filter slides across the input. Stride > 1 downsamples the feature map; stride = 1 preserves spatial resolution. *(Sources: MARL)* | |
| **U-Net** — An encoder-decoder CNN with skip connections between encoder and decoder layers. Originally for biomedical segmentation; widely used for semantic segmentation and **diffusion model** denoising networks. | |
| --- | |
| ## 11. Natural Language Processing | |
| **Bag of Words (BoW)** — Represents a document as a vector of word counts, ignoring order. Simple and effective baseline; extended by **TF-IDF** weighting. *(Sources: MLCS, MLP)* | |
| **Byte Pair Encoding (BPE)** — A subword **tokenization** algorithm that iteratively merges the most frequent adjacent token pairs. Balances vocabulary size with coverage of rare words. *(Sources: DLP)* | |
| **CRF (Conditional Random Field)** — (See Probabilistic Graphical Models.) In NLP, linear-chain CRFs are used for sequence labeling (NER, POS tagging) by modeling label transition probabilities. *(Sources: MLP)* | |
| **FastText** — Extends word2vec by representing words as bags of character n-grams. Handles out-of-vocabulary words and morphologically rich languages better than word-level embeddings. | |
| **GloVe (Global Vectors)** — Trains word embeddings by factorizing the log word-word co-occurrence matrix. Captures global corpus statistics; competitive with word2vec on analogy tasks. | |
| **Language Model** — A probability distribution over sequences of tokens: p(x₁,…,xₙ). Evaluated by **perplexity**; modern large language models (**GPT**, **BERT**) pre-train on massive corpora. *(Sources: MLP)* | |
| **Named Entity Recognition (NER)** — Identifies and classifies named entities (persons, organizations, locations) in text. Sequence labeling task; solved with **CRFs**, **LSTMs**, or fine-tuned **transformers**. *(Sources: NLPP-nlp-python-nltk-oreilly)* | |
| **N-gram** — A contiguous sequence of n tokens. Unigrams (n=1), bigrams (n=2), trigrams (n=3). Language models based on n-gram counts suffer from data sparsity; largely replaced by neural models. *(Sources: MLP)* | |
| **Perplexity** — 2^{H(P,Q)} = exp(−(1/N)Σ log p(xᵢ)). Measures how well a language model predicts a test corpus; lower is better. Geometric mean of inverse probabilities. *(Sources: MLP)* | |
| **Sequence-to-Sequence (Seq2Seq)** — An encoder-decoder architecture for mapping one sequence to another (e.g., translation, summarization). The encoder reads the input; the decoder generates the output autoregressively. *(Sources: PDL)* | |
| **TF-IDF** — Term Frequency × Inverse Document Frequency: weights words by how often they appear in a document relative to how common they are across documents. Standard sparse text representation for retrieval and classification. *(Sources: MLP)* | |
| **Word2Vec** — Trains word embeddings by predicting context words (skip-gram) or the center word (CBOW) from local windows. Captures semantic relationships via vector arithmetic (king − man + woman ≈ queen). | |
| --- | |
| ## 12. Reinforcement Learning | |
| **Actor-Critic** — Combines a policy (**actor**) and a value function (**critic**) to reduce gradient variance. The critic provides a baseline; A2C and A3C are synchronous/asynchronous variants. *(Sources: ADFM, MARL)* | |
| **Bellman Equation** — Recursive relationship for value functions: V(s) = Σₐ π(a|s)[R(s,a) + γΣₛ' P(s'|s,a)V(s')]. The foundation of dynamic programming and TD learning. *(Sources: DRL, MARL, OPA)* | |
| **Discount Factor (γ)** — Weights future rewards: return = Σₜ γᵗrₜ. γ ∈ [0,1]; γ=0 is myopic, γ→1 values future rewards equally. Necessary for convergence of infinite-horizon problems. *(Sources: DRL, MARL)* | |
| **Dynamic Programming (DP)** — Solves MDPs exactly given a model by iterating the **Bellman equation**: policy evaluation (V → Vπ) and policy improvement, or value iteration. *(Sources: ADFM, AFO, AJE, DRL, ITA, MARL, MLP, NLPP-nlp-python-nltk-oreilly, RL)* | |
| **Exploration-Exploitation Tradeoff** — The dilemma between exploiting known high-reward actions and exploring uncertain ones. Addressed by ε-greedy, UCB, Thompson sampling, or intrinsic motivation. *(Sources: AFO, PAI)* | |
| **Markov Decision Process (MDP)** — A tuple (S, A, P, R, γ): states, actions, transition probabilities, reward function, discount factor. The standard formalism for sequential decision-making under uncertainty. *(Sources: DRL, MARL, MLP)* | |
| **Monte Carlo Tree Search (MCTS)** — Plans by building a search tree via simulation. Selection (UCT), expansion, rollout, backpropagation phases. Used in AlphaGo and game-playing agents. *(Sources: ADFM, DRL, MARL, RL)* | |
| **Policy** — A mapping π: S → Δ(A) from states to distributions over actions. The agent's behavior; **RL** aims to find the policy maximizing expected discounted return. *(Sources: DRL, FDL, MARL, MLP)* | |
| **Policy Gradient** — Optimizes the policy directly by estimating ∇E[return] = E[∇log π(a|s)·G]. High variance but applicable to continuous and stochastic action spaces; REINFORCE is the simplest variant. *(Sources: MARL)* | |
| **PPO (Proximal Policy Optimization)** — A policy gradient method that clips the probability ratio between new and old policies to prevent large updates. Stable, sample-efficient; default RL algorithm for LLM fine-tuning (**RLHF**). *(Sources: MARL)* | |
| **Q-Function (Action-Value Function)** — Q(s,a) = E[Σₜ γᵗrₜ | s₀=s, a₀=a, π]. The expected return starting from state s, taking action a, then following policy π. **Q-learning** estimates Q directly. *(Sources: MARL)* | |
| **Q-Learning** — Off-policy TD algorithm: Q(s,a) ← Q(s,a) + α[r + γ max_{a'} Q(s',a') − Q(s,a)]. Converges to optimal Q* under mild conditions; extended to deep RL by **DQN**. *(Sources: ADFM, DRL, MARL)* | |
| **Deep Q-Network (DQN)** — Combines Q-learning with a deep neural network to approximate Q*. Uses experience replay and a target network for stability; achieved human-level Atari performance. *(Sources: FDL)* | |
| **Reward Shaping** — Adding auxiliary reward signals to speed up learning. Must be done carefully to avoid changing the optimal policy; potential-based shaping is policy-invariant. *(Sources: ADFM)* | |
| **SAC (Soft Actor-Critic)** — Off-policy actor-critic maximizing a maximum-entropy objective: E[return] + αH[π]. Entropy regularization encourages exploration; state-of-the-art for continuous control. | |
| **SARSA** — On-policy TD algorithm: Q(s,a) ← Q(s,a) + α[r + γQ(s',a') − Q(s,a)]. Converges to the optimal policy for the behavior policy; safer but less efficient than **Q-learning**. *(Sources: ADFM, MARL)* | |
| **Temporal Difference (TD) Learning** — Updates value estimates using bootstrapped targets: V(s) ← V(s) + α[r + γV(s') − V(s)]. Combines the online nature of Monte Carlo with bootstrapping; **Q-learning** and **SARSA** are special cases. | |
| **Value Function** — V(s) = E[Σₜ γᵗrₜ | s₀=s, π]. The expected discounted return from state s following policy π. Estimated by TD learning, Monte Carlo, or dynamic programming. *(Sources: DRL)* | |
| --- | |
| ## 13. Algorithms & Data Structures | |
| **A* Search** — Heuristic graph search minimizing f(n) = g(n) + h(n), where g is cost-so-far and h is an admissible heuristic. Optimal and complete when the heuristic is admissible. | |
| **Amortized Analysis** — Assigns costs to operations such that the average cost per operation over a sequence is bounded, even if individual operations are expensive. Used to analyze dynamic arrays, splay trees, and union-find. *(Sources: ITA)* | |
| **B-tree** — A self-balancing tree with high branching factor (many children per node). Minimizes disk I/O; standard in database indexes and file systems. *(Sources: ITA)* | |
| **BFS (Breadth-First Search)** — Explores a graph level-by-level. Finds shortest paths (unweighted); time O(V+E); requires a queue. *(Sources: ITA)* | |
| **Bloom Filter** — A probabilistic data structure for set membership queries. Space-efficient; allows false positives but not false negatives; used in databases and network systems. | |
| **DFS (Depth-First Search)** — Explores a graph by going deep before backtracking. Used for topological sort, cycle detection, and strongly connected components; time O(V+E). *(Sources: ITA)* | |
| **Dijkstra's Algorithm** — Single-source shortest paths for non-negative edge weights. Uses a priority queue; O((V+E) log V) with a binary heap. *(Sources: ITA, OPA)* | |
| **Dynamic Programming** — Solves optimization problems by breaking them into overlapping subproblems and storing solutions (memoization or tabulation). Requires optimal substructure and overlapping subproblems. *(Sources: ADFM, AFO, AJE, DRL, ITA, MARL, MLP, NLPP-nlp-python-nltk-oreilly, RL)* | |
| **Hash Table** — A data structure mapping keys to values via a hash function. Expected O(1) lookup, insert, delete; collisions handled by chaining or open addressing. *(Sources: ITA)* | |
| **Heap (Priority Queue)** — A complete binary tree satisfying the heap property (parent ≤ or ≥ children). Supports O(log n) insert and extract-min/max; used in Dijkstra, heapsort, and scheduling. *(Sources: ITA)* | |
| **Minimum Spanning Tree (MST)** — A spanning tree of a weighted graph with minimum total edge weight. Found by Kruskal's (sort + union-find) or Prim's (greedy + priority queue) algorithms. *(Sources: ITA, MLP)* | |
| **Quicksort** — Divide-and-conquer sort that partitions around a pivot. Average O(n log n); worst-case O(n²) (avoided by randomized pivot); in-place with small constant factors. *(Sources: AJE, ITA)* | |
| **Red-Black Tree** — A self-balancing BST maintaining height O(log n) via coloring rules. Guarantees O(log n) search, insert, delete; used in language standard library maps and sets. *(Sources: ITA)* | |
| **Strongly Connected Components (SCC)** — Maximal subgraphs where every vertex reaches every other. Found in O(V+E) by Tarjan's or Kosaraju's algorithm. *(Sources: ITA)* | |
| **Topological Sort** — A linear ordering of DAG vertices such that each edge points forward. Found by DFS finishing-time order or Kahn's BFS algorithm; used in dependency resolution. *(Sources: AJE, ITA)* | |
| **Union-Find (Disjoint Set Union)** — Maintains a partition of elements under union and find operations. With path compression and union by rank, nearly O(1) per operation (inverse Ackermann). *(Sources: ACPS, ITA)* | |
| --- | |
| ## 14. Signal Processing & Transforms | |
| **Convolution** — (f * g)(t) = ∫ f(τ)g(t−τ)dτ. Multiplication in the frequency domain; used for filtering, smoothing, and feature extraction in CNNs. *(Sources: DSP, IALA, MARL, PDL, PSM)* | |
| **Discrete Fourier Transform (DFT)** — Transforms a finite sequence xₙ into frequency components Xₖ = Σₙ xₙ e^{−2πink/N}. Reveals periodicity; naively O(N²), efficiently computed as **FFT**. *(Sources: DSP)* | |
| **Fast Fourier Transform (FFT)** — O(N log N) algorithm for computing the **DFT** via divide-and-conquer (Cooley-Tukey). The most important numerical algorithm; enables fast convolution, spectral analysis, and audio processing. *(Sources: DSP, ITA)* | |
| **Fourier Transform** — Decomposes a function into sinusoidal components: F̂(ω) = ∫ f(t)e^{−2πiωt}dt. Converts between time and frequency domains; invertible with the inverse Fourier transform. *(Sources: DSP, GPML, ITA, PDL)* | |
| **Impulse Response** — The output of a linear time-invariant system to a unit impulse (Dirac delta). Fully characterizes the system; the output to any input is the **convolution** of the input with the impulse response. *(Sources: DSP)* | |
| **Kalman Filter** — An optimal recursive linear estimator for linear Gaussian state-space models. Alternates prediction (using system dynamics) and update (incorporating observations). Extended Kalman Filter handles nonlinear systems. *(Sources: MLP, OPA)* | |
| **Laplace Transform** — L{f(t)}(s) = ∫₀^∞ f(t)e^{−st}dt. Extends the Fourier transform to a complex domain; converts differential equations to algebraic ones; used for system analysis and control. *(Sources: DSP)* | |
| **Nyquist Sampling Theorem** — A bandlimited signal with maximum frequency fₘₐₓ is perfectly reconstructed from samples taken at rate ≥ 2fₘₐₓ (Nyquist rate). Sampling below this rate causes aliasing. | |
| **Power Spectral Density (PSD)** — The distribution of signal power over frequency. Estimated by periodogram or Welch's method; used in stationarity assessment and filter design. | |
| **Short-Time Fourier Transform (STFT)** — Applies the Fourier transform to windowed segments of a signal. Produces a time-frequency spectrogram; temporal resolution vs. frequency resolution tradeoff (uncertainty principle). | |
| **Wavelet Transform** — Decomposes a signal using basis functions (wavelets) that are localized in both time and frequency. Provides multi-resolution analysis; used in compression (JPEG2000), denoising, and medical imaging. *(Sources: MLP)* | |
| **Wiener Filter** — Optimal linear filter minimizing the mean squared error between the filtered signal and a desired signal. Requires knowledge of signal and noise spectra; basis for many denoising methods. *(Sources: DSP)* | |
| **Z-Transform** — Z{xₙ}(z) = Σₙ xₙ z^{−n}. The discrete-time analog of the **Laplace transform**; used to analyze digital filters and discrete-time systems. *(Sources: DSP)* | |
| --- | |
| ## 15. Bayesian & Probabilistic Methods | |
| **Bayesian Network** — A directed acyclic graph encoding conditional independence relationships between variables. Joint distribution factorizes as p(X) = Πᵢ p(Xᵢ | parents(Xᵢ)); used for reasoning under uncertainty. *(Sources: ADFM)* | |
| **Bayesian Optimization** — Uses a surrogate model (typically a **Gaussian process**) and an acquisition function to optimize expensive black-box functions. Efficient for hyperparameter tuning; balances exploration and exploitation. *(Sources: PAI)* | |
| **Empirical Bayes** — Estimates the prior hyperparameters from data (by marginalizing over parameters) rather than specifying them a priori. A compromise between full Bayes and MLE; used in hierarchical models. *(Sources: CASI, MLCS, MLP)* | |
| **Gibbs Sampling** — A Markov Chain Monte Carlo method that samples each variable conditioned on the current values of all others. Converges to the joint distribution; easy to implement when conditionals are tractable. *(Sources: ADFM, DLG, MLCS, MLP)* | |
| **Hierarchical Model** — A model with multiple levels of latent variables, where lower-level parameters are drawn from higher-level distributions. Enables partial pooling of information across groups. *(Sources: PDSH)* | |
| **Markov Chain Monte Carlo (MCMC)** — A class of algorithms for sampling from a target distribution by constructing a Markov chain with that distribution as its stationary distribution. Includes Metropolis-Hastings, **Gibbs sampling**, and HMC. *(Sources: GPML, MLCS, MLP)* | |
| **Metropolis-Hastings** — An MCMC algorithm that proposes moves from a proposal distribution and accepts or rejects them with a ratio that ensures detailed balance. General-purpose; acceptance rate tuning is important. *(Sources: AFO)* | |
| **Hamiltonian Monte Carlo (HMC)** — MCMC using Hamiltonian dynamics to make large, informed proposals. Suppresses random-walk behavior; more efficient than Metropolis-Hastings in high dimensions; used in Stan. *(Sources: ITA)* | |
| **Prior Distribution** — The distribution P(θ) encoding beliefs about parameters before observing data. Informative priors encode domain knowledge; uninformative (flat, Jeffreys) priors minimize prior influence. *(Sources: DMMD)* | |
| **Posterior Distribution** — P(θ|data) ∝ P(data|θ)P(θ). The updated belief about parameters after observing data. The central object of Bayesian inference; summarized by MAP, posterior mean, or credible intervals. *(Sources: AFO, DMMD)* | |
| --- | |
| *Generated from index and TOC extractions across 27 textbooks: linear algebra (LADR, LA4, FLAO, IALA, LAY, MC), probability (IP, MLP), optimization (AFO, OPA, CO), classical ML (ISL, MLCS, SKL), deep learning (DLG, DLP, PDL, FDL, ADNN), reinforcement learning (RL, DRL, MARL), computer vision (OCV, CVAA, PCVP), NLP (NLPP, PDL), algorithms (ITA, DSA, AJE, AFO), and signal processing (DSP).* |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment