Optimization Algorithms Overview

Applications Overview

1. Gradient Descent (GD) & Variants (SGD, Adam)

Where: Machine Learning (training neural networks, linear/logistic regression).
Why:
- Simplicity: Easy to implement for differentiable objectives.
- Scalability: Stochastic variants (SGD) handle large datasets.
- Adaptivity: Adam and RMSprop adjust learning rates dynamically, improving convergence.
Example: Training deep learning models (e.g., ResNet, GPT).

2. Genetic Algorithms (GA)

Where: Engineering design, combinatorial optimization (e.g., aerospace, logistics).
Why:
- Global Search: Avoids local optima in non-convex/non-differentiable problems.
- Flexibility: Works with discrete/continuous variables.
Example: Optimizing aircraft wing shapes or factory layouts.

3. Simplex Method (Linear Programming)

Where: Operations Research (supply chain, resource allocation).
Why:
- Efficiency: Solves linear problems with thousands of variables.
- Robustness: Well-understood for real-world LP problems.
Example: Maximizing profit in manufacturing resource allocation.

4. Particle Swarm Optimization (PSO)

Where: Control systems, robotics, antenna design.
Why:
- Swarm Intelligence: Balances exploration and exploitation without gradients.
- Parallelism: Easy to parallelize for distributed systems.
Example: Tuning PID controllers in industrial automation.

5. Interior-Point Methods

Where: Convex optimization (finance, engineering).
Why:
- Speed: Polynomial-time convergence for large convex problems.
- Accuracy: Handles ill-conditioned matrices better than Simplex.
Example: Portfolio optimization (Markowitz model).

6. Bayesian Optimization

Where: Hyperparameter tuning (ML/AI), drug discovery.
Why:
- Sample Efficiency: Minimizes expensive evaluations (e.g., training large models).
- No Gradients: Ideal for black-box functions.
Example: Optimizing learning rates/batch sizes for deep learning.

7. Simulated Annealing (SA)

Where: VLSI design, scheduling, materials science.
Why:
- Escapes Local Optima: Probabilistic acceptance of worse solutions.
- Simplicity: Easy to implement for combinatorial problems.
Example: Optimizing chip layouts to minimize heat dissipation.

8. L-BFGS (Limited-Memory BFGS)

Where: Medium-scale ML models (e.g., logistic regression, SVMs).
Why:
- Memory Efficiency: Approximates Hessian without storing large matrices.
- Speed: Faster convergence than vanilla GD for smooth functions.
Example: Training medium-sized neural networks.

9. NSGA-II (Multi-Objective)

Where: Engineering design, economics, environmental planning.
Why:
- Pareto Optimality: Maintains diverse solutions on the trade-off front.
- Scalability: Handles multiple conflicting objectives.
Example: Designing cars for cost, safety, and fuel efficiency.

10. Alternating Direction Method of Multipliers (ADMM)

Where: Distributed systems, image processing, finance.
Why:
- Decentralization: Splits problems into subproblems for parallel solving.
- Convergence: Works for non-smooth/non-convex problems.
Example: Distributed training of federated learning models.

Why These Algorithms Dominate

Problem-Specific Strengths:
- Gradient-based methods (GD, Adam) dominate ML due to automatic differentiation frameworks (PyTorch, TensorFlow).
- Metaheuristics (GA, PSO) excel in non-differentiable/combinatorial spaces.
- Interior-point methods solve large convex problems faster than Simplex.
Scalability:
- SGD and Adam handle big data via mini-batches.
- Bayesian Optimization minimizes costly evaluations (e.g., lab experiments).
Ease of Use:
- Libraries like scipy.optimize and Optuna abstract complexity.
Theoretical Guarantees:
- Convex methods (Interior-Point, ADMM) have convergence proofs.

Emerging Trends

Quantum Optimization: QAOA for finance/chemistry (still experimental).
AutoML: Algorithm selection via meta-learning (e.g., Google’s Vizier).
Differentiable Optimization: Integrating solvers into neural networks (e.g., CVXPYLayer).

Visualization

1. Gradient Descent

graph TD
  A[Initialize Parameters] --> B[Compute Gradient]
  B --> C[Update Parameters]
  C --> D{Converged?}
  D -->|No| B
  D -->|Yes| E[Return Parameters]

2. Genetic Algorithm

graph TD
  A[Initialize Population] --> B[Evaluate Fitness]
  B --> C[Select Parents]
  C --> D[Crossover]
  D --> E[Mutation]
  E --> F[Replace Population]
  F --> G{Max Generations?}
  G -->|No| B
  G -->|Yes| H[Return Best]

3. Simplex Method (Linear Programming)

graph TD
  A[Initialize Tableau] --> B[Select Pivot Column]
  B --> C[Select Pivot Row]
  C --> D[Perform Pivot]
  D --> E{Optimal?}
  E -->|No| B
  E -->|Yes| F[Return Solution]

4. Particle Swarm Optimization (PSO)

graph TD
  A[Initialize Particles] --> B[Update Velocities]
  B --> C[Update Positions]
  C --> D[Evaluate Fitness]
  D --> E[Update Best]
  E --> F{Stopping Condition?}
  F -->|No| B
  F -->|Yes| G[Return Best]

5. Adam Optimizer

graph TD
  A[Initialize m, v] --> B[Compute Gradient]
  B --> C[Update m]
  C --> D[Update v]
  D --> E[Bias Correction]
  E --> F[Update Parameters]
  F --> G{Converged?}
  G -->|No| B
  G -->|Yes| H[Return Parameters]

6. Simulated Annealing

graph TD
  A[Initialize T, Solution] --> B[Perturb Solution]
  B --> C[Calculate ΔE]
  C --> D{Accept?}
  D -->|Yes| E[Update Solution]
  D -->|No| F[Reject]
  E --> G[Cool T]
  F --> G
  G --> H{T > T_min?}
  H -->|Yes| B
  H -->|No| I[Return Best]

7. Ant Colony Optimization (ACO)

graph TD
  A[Initialize Pheromones] --> B[Ants Build Paths]
  B --> C[Update Pheromones]
  C --> D{Stopping Condition?}
  D -->|No| B
  D -->|Yes| E[Return Best Path]

8. Bayesian Optimization

graph TD
  A[Initialize Model] --> B[Select New Point]
  B --> C[Evaluate Function]
  C --> D[Update Model]
  D --> E{Max Iterations?}
  E -->|No| B
  E -->|Yes| F[Return Best]

9. NSGA-II (Multi-Objective)

graph TD
  A[Initialize Population] --> B[Non-Dominated Sort]
  B --> C[Crowding Distance]
  C --> D[Select Top]
  D --> E[Crossover/Mutation]
  E --> F{Generations < Max?}
  F -->|Yes| B
  F -->|No| G[Return Pareto Front]

10. ADMM

graph TD
  A[Initialize x, z, y] --> B[Update x]
  B --> C[Update z]
  C --> D[Update y]
  D --> E{Converged?}
  E -->|No| B
  E -->|Yes| F[Return x, z]

11. Levenberg-Marquardt

graph TD
  A[Initialize] --> B[Compute Residual]
  B --> C[Solve Equation]
  C --> D[Update Parameters]
  D --> E{Error Reduced?}
  E -->|Yes| F[Adjust λ]
  E -->|No| G[Adjust λ]
  F --> H{Converged?}
  G --> H
  H -->|No| B
  H -->|Yes| I[Return Parameters]

12. Nelder-Mead (Simplex)

graph TD
  A[Initialize Simplex] --> B[Sort Points]
  B --> C[Compute Centroid]
  C --> D[Reflect]
  D --> E{Improved?}
  E -->|Yes| F[Expand]
  E -->|No| G[Contract]
  F --> H[Replace]
  G --> H
  H --> I{Converged?}
  I -->|No| B
  I -->|Yes| J[Return Best]

13. QAOA (Quantum Optimization)

graph TD
  A[Initialize Parameters] --> B[Prepare State]
  B --> C[Measure Energy]
  C --> D[Optimize Parameters]
  D --> E{Converged?}
  E -->|No| B
  E -->|Yes| F[Return Parameters]

14. Branch and Bound

graph TD
  A[Initialize Queue] --> B[Dequeue Node]
  B --> C{Node Valid?}
  C -->|Yes| D[Branch]
  C -->|No| E[Prune]
  D --> F[Enqueue Children]
  F --> B
  E --> B
  B --> G{Queue Empty?}
  G -->|Yes| H[Return Best]
  G -->|No| B

15. Differential Evolution

graph TD
  A[Initialize Population] --> B[For Each Vector]
  B --> C[Select Parents]
  C --> D[Generate Mutant]
  D --> E[Crossover]
  E --> F{Better?}
  F -->|Yes| G[Replace]
  F -->|No| H[Keep]
  G --> I{All Done?}
  H --> I
  I -->|No| B
  I -->|Yes| J{Converged?}
  J -->|No| B
  J -->|Yes| K[Return Best]

Theoretical and Implementation

1. Classical Optimization Algorithms

Gradient-Based Methods

Gradient Descent
- Theory: Iteratively updates parameters in the direction of the negative gradient to minimize a function.
- Example: Training linear/logistic regression models.
- Pseudo-Code:
```
Initialize θ, learning rate α  
while not converged:  
    Compute gradient ∇J(θ)  
    θ = θ - α * ∇J(θ)  
```

Conjugate Gradient

Theory: Solves linear systems by generating conjugate (orthogonal) search directions to accelerate convergence.
Example: Solving large sparse systems (e.g., finite element analysis).

Pseudo-Code:

Initialize θ, residual r = b - Aθ, direction d = r  
while ||r|| > tolerance:  
    α = (rᵀr) / (dᵀAd)  
    θ = θ + α*d  
    r_new = r - α*A*d  
    β = (r_newᵀr_new) / (rᵀr)  
    d = r_new + β*d  
    r = r_new

Newton’s Method
- Theory: Uses second derivatives (Hessian) for faster convergence near minima.
- Example: Portfolio optimization in finance.
- Pseudo-Code:
```
Initialize θ  
while not converged:  
    Compute gradient ∇J(θ), Hessian H  
    θ = θ - H⁻¹∇J(θ)  
```

Quasi-Newton (BFGS, L-BFGS)

Theory: Approximates the Hessian using gradient updates to avoid direct computation.
Example: Training support vector machines (SVMs).

Pseudo-Code (BFGS):

Initialize θ, B = I (identity matrix)  
while not converged:  
    Compute direction d = -B∇J(θ)  
    Perform line search for step size α  
    s = α*d  
    y = ∇J(θ + s) - ∇J(θ)  
    B = B + (y yᵀ)/(yᵀs) - (B s sᵀ B)/(sᵀ B s)  
    θ = θ + s

Levenberg-Marquardt

Theory: Hybrid of gradient descent and Gauss-Newton for nonlinear least squares.
Example: Calibrating camera parameters in computer vision.

Pseudo-Code:

Initialize θ, λ = 0.001  
while not converged:  
    Compute residual r, Jacobian J  
    Δθ = (JᵀJ + λI)⁻¹ Jᵀr  
    if J(θ + Δθ) < J(θ):  
        θ = θ + Δθ  
        λ = λ/10  
    else:  
        λ = λ*10

Derivative-Free Methods

Nelder-Mead (Simplex)

Theory: Adjusts a simplex through reflection, expansion, and contraction.
Example: Tuning hyperparameters for industrial simulations.

Pseudo-Code:

Initialize simplex with n+1 points  
while not converged:  
    Sort points by f(x)  
    Compute centroid (excluding worst)  
    Reflect worst point over centroid  
    if f(reflected) < best: Expand  
    elif f(reflected) < worst: Replace worst  
    else: Contract or shrink simplex

Pattern Search

Theory: Explores points in a predefined geometric pattern (e.g., grid).
Example: Optimizing aircraft wing shapes.

Pseudo-Code:

Initialize θ, step size Δ  
while Δ > tolerance:  
    Evaluate f(θ ± Δ in all directions)  
    if improvement found:  
        Move θ to better point  
        Δ = 1.5*Δ  # Expand step  
    else:  
        Δ = 0.5*Δ  # Shrink step

Linear Programming

Simplex Method

Theory: Moves along vertices of the feasible region to find the optimal solution.
Example: Maximizing profit in manufacturing resource allocation.

Pseudo-Code:

Convert to standard form (Ax = b, x ≥ 0)  
while pivot exists:  
    Select entering variable (most negative coefficient)  
    Select leaving variable (min ratio test)  
    Pivot to update tableau

Interior-Point Methods

Theory: Traverses the interior of the feasible region using barrier functions.
Example: Solving large-scale logistics optimization.

Pseudo-Code:

Initialize x > 0, μ > 0, tolerance  
while duality gap > tolerance:  
    Solve Newton step for barrier problem  
    Update x, μ = σμ (σ ∈ (0,1))

Nonlinear Programming

Trust Region Methods

Theory: Approximates the objective locally with a model (e.g., quadratic) within a trusted region.
Example: Fitting nonlinear curves to sensor data.

Pseudo-Code:

Initialize θ, trust region radius Δ  
while not converged:  
    Solve subproblem: min model m_k(θ + s) within ||s|| ≤ Δ  
    Compute actual/predicted reduction ratio ρ  
    if ρ > 0.75: Δ = 2Δ  # Expand region  
    elif ρ < 0.25: Δ = 0.5Δ  # Shrink region  
    Update θ

Sequential Quadratic Programming (SQP)

Theory: Solves constrained problems via a sequence of quadratic subproblems.
Example: Trajectory optimization for rockets.

Pseudo-Code:

Initialize θ, Lagrange multipliers λ  
while not converged:  
    Compute Hessian H, gradient ∇f, Jacobian J  
    Solve QP: min 0.5sᵀHs + ∇fᵀs s.t. Js + c = 0  
    Update θ = θ + s, λ = new multipliers

2. Metaheuristics (Global Optimization)

Evolutionary Algorithms

Genetic Algorithm (GA)

Theory: Evolves solutions using selection, crossover, and mutation.
Example: Optimizing wind farm layouts.

Pseudo-Code:

Initialize population  
while generations < max_gen:  
    Evaluate fitness  
    Select parents via tournament/roulette  
    Crossover parents to create offspring  
    Mutate offspring  
    Replace worst individuals

Differential Evolution (DE)

Theory: Generates mutants by adding scaled differences between population vectors.
Example: Optimizing drug discovery pipelines.

Pseudo-Code:

Initialize population  
for each individual x_i:  
    Select x_a, x_b, x_c ≠ x_i  
    mutant = x_a + F*(x_b - x_c)  
    trial = crossover(x_i, mutant)  
    if f(trial) < f(x_i): replace x_i

Evolution Strategies (ES)

Theory: Adapts mutation strength based on success history.
Example: Training robotic control policies.

Pseudo-Code:

Initialize θ, σ (mutation strength)  
while not converged:  
    Generate offspring by θ + σ*N(0, I)  
    Evaluate fitness  
    Update θ and σ based on top offspring

Swarm Intelligence

Particle Swarm Optimization (PSO)

Theory: Particles adjust trajectories based on personal and global best positions.
Example: Optimizing antenna array designs.

Pseudo-Code:

Initialize particles with positions and velocities  
while not converged:  
    for each particle:  
        v = w*v + c1*rand()*(pbest - x) + c2*rand()*(gbest - x)  
        x = x + v  
        Update pbest and gbest

Ant Colony Optimization (ACO)

Theory: Ants probabilistically build paths using pheromone trails.
Example: Optimizing traffic light cycles.

Pseudo-Code:

Initialize pheromone trails τ  
while not converged:  
    for each ant:  
        Build path via τ probabilities  
    Update τ (evaporate + deposit based on path quality)

Artificial Bee Colony (ABC)

Theory: Simulates bees (employed, onlookers, scouts) foraging for optimal solutions.
Example: Task scheduling in cloud computing.

Pseudo-Code:

Initialize food sources (solutions)  
while cycles < max_cycles:  
    Employed bees: Explore nearby solutions  
    Onlooker bees: Select solutions probabilistically  
    Scout bees: Replace abandoned solutions

Physics/Chemistry-Inspired

Simulated Annealing (SA)

Theory: Accepts worse solutions early to escape local optima, mimicking metal cooling.
Example: VLSI chip placement.

Pseudo-Code:

Initialize T (temperature), current_solution  
while T > T_min:  
    new_solution = perturb(current_solution)  
    ΔE = f(new) - f(current)  
    if ΔE < 0 or rand() < exp(-ΔE/T):  
        current_solution = new_solution  
    T = cool(T)

Harmony Search

Theory: Mimics musicians improvising to find harmonies (solutions).
Example: Optimizing water distribution networks.

Pseudo-Code:

Initialize harmony memory (HM)  
while not converged:  
    Improvise new harmony via HM or randomness  
    if new harmony better than worst in HM:  
        Replace worst in HM

Gravitational Search

Theory: Masses (solutions) attract each other based on gravitational force.
Example: Image segmentation.

Pseudo-Code:

Initialize masses with positions  
while not converged:  
    Calculate gravitational forces between masses  
    Update velocities and positions  
    Evaluate fitness and update masses

Hybrids

Memetic Algorithms

Theory: Combines GA with local search for refinement.
Example: Optimizing cancer radiotherapy plans.

Pseudo-Code:

Initialize population  
while generations < max_gen:  
    Perform GA operations (crossover/mutation)  
    Apply local search to top individuals

Hyper-Heuristics

Theory: Selects or generates heuristics dynamically.
Example: Automated algorithm selection for SAT solvers.

Pseudo-Code:

Initialize set of low-level heuristics  
while not converged:  
    Select heuristic based on performance history  
    Apply heuristic to current solution

3. Stochastic Optimization

Stochastic Gradient Descent (SGD)

Theory: Uses random mini-batches to approximate gradients.
Example: Training deep neural networks.

Pseudo-Code:

Initialize θ, α  
for epoch in 1 to max_epochs:  
    Shuffle data  
    for each mini-batch:  
        θ = θ - α*∇J(θ, mini-batch)

Randomized Algorithms

Theory: Uses randomness to solve deterministic problems efficiently.
Example: Randomized SVD for large matrices.

Pseudo-Code:

Generate random matrix Ω  
Compute sketch Y = AΩ  
Orthonormalize Y to get Q  
Compute B = QᵀA  
Perform SVD on B

Stochastic Approximation
- Theory: Iteratively estimates solutions with noisy observations.
- Example: Reinforcement learning (Q-learning).
- Pseudo-Code:
```
Initialize θ  
for t = 1, 2, ...:  
    Observe noisy gradient g_t  
    θ = θ - α_t * g_t  
```

4. Convex Optimization

Interior-Point Methods

Theory: Uses barrier functions to handle constraints.
Example: Portfolio optimization with risk constraints.

Pseudo-Code:

Initialize x, μ > 0  
while μ > tolerance:  
    Solve min f(x) - μ∑log(c_i(x))  
    μ = σμ (σ ∈ (0,1))

Subgradient Methods
- Theory: Generalizes gradient descent for non-differentiable functions.
- Example: Training SVMs with hinge loss.
- Pseudo-Code:
```
Initialize θ, α  
while not converged:  
    Compute subgradient g  
    θ = θ - α*g  
```

Proximal Algorithms (ADMM)

Theory: Splits problems into subproblems with proximal operators.
Example: Distributed matrix factorization.

Pseudo-Code:

Initialize x, z, y  
while not converged:  
    x = argmin_x [f(x) + (ρ/2)||x - z + y||²]  
    z = argmin_z [g(z) + (ρ/2)||x - z + y||²]  
    y = y + (x - z)

5. Constrained Optimization

Penalty Methods
- Theory: Adds penalty terms to the objective for constraint violations.
- Example: Designing structures with material limits.
- Pseudo-Code:
```
Initialize θ, μ = 1  
while not converged:  
    Solve min f(θ) + μ*penalty(c(θ))  
    μ = 10*μ  
```

Barrier Methods

Theory: Uses logarithmic barriers to enforce constraints.
Example: Optimal control in robotics.

Pseudo-Code:

Initialize θ, μ = 1  
while μ > tolerance:  
    Solve min f(θ) - μ∑log(-c_i(θ))  
    μ = 0.1*μ

Augmented Lagrangian

Theory: Combines Lagrangian multipliers with penalty terms.
Example: Economic dispatch in power systems.

Pseudo-Code:

Initialize λ, μ  
while not converged:  
    Solve min f(x) + λᵀc(x) + (μ/2)||c(x)||²  
    λ = λ + μ*c(x)  
    μ = increase_if_needed

6. Combinatorial Optimization

Branch and Bound

Theory: Explores solution space via tree traversal with pruning.
Example: Solving the knapsack problem.

Pseudo-Code:

Initialize queue with root node, best = ∞  
while queue not empty:  
    node = dequeue()  
    if node.cost >= best: prune  
    if node is feasible: update best  
    else: branch into child nodes

Dynamic Programming

Theory: Reuses solutions to overlapping subproblems.
Example: RNA secondary structure prediction.

Pseudo-Code:

Initialize memo table  
for i in 1 to n:  
    for j in 1 to m:  
        dp[i][j] = min(dp[i-1][j], dp[i][j-1], ...)

Greedy Algorithms

Theory: Makes locally optimal choices at each step.
Example: Huffman coding for data compression.

Pseudo-Code:

Sort items by some criterion  
while solution not complete:  
    Select next best item  
    if feasible: add to solution

Tabu Search

Theory: Avoids revisiting solutions using a tabu list.
Example: Vehicle routing problems.

Pseudo-Code:

Initialize current_solution, tabu_list  
while not converged:  
    Generate neighbors not in tabu_list  
    Select best neighbor  
    Update current_solution and tabu_list

7. Multi-Objective Optimization

NSGA-II

Theory: Uses non-dominated sorting and crowding distance for Pareto fronts.
Example: Designing eco-friendly buildings.

Pseudo-Code:

Initialize population  
while generations < max_gen:  
    Create offspring via crossover/mutation  
    Combine parents + offspring  
    Rank by non-domination and crowding  
    Select top N individuals

MOEA/D

Theory: Decomposes multi-objective problems into scalar subproblems.
Example: Optimizing turbine efficiency and cost.

Pseudo-Code:

Initialize weight vectors  
for each subproblem:  
    Solve using neighboring subproblems  
    Update solutions and neighborhoods

Pareto Front Methods

Theory: Directly approximates the Pareto optimal set.
Example: Aerospace design trade-offs.

Pseudo-Code:

Initialize population  
while not converged:  
    Update solutions using dominance criteria  
    Maintain archive of non-dominated solutions

8. Derivative-Free Optimization

DIRECT

Theory: Divides search space into hyper-rectangles for global optimization.
Example: Tuning hyperparameters for climate models.

Pseudo-Code:

Initialize hyper-rectangles  
while budget not exhausted:  
    Identify potentially optimal rectangles  
    Split and sample new points

Bayesian Optimization

Theory: Uses probabilistic models (e.g., Gaussian processes) to guide search.
Example: Optimizing machine learning hyperparameters.

Pseudo-Code:

Initialize surrogate model  
for i in 1 to max_iter:  
    Select x* maximizing acquisition function  
    Evaluate f(x*)  
    Update surrogate model

Response Surface Methods

Theory: Fits a surrogate model (e.g., polynomial) to approximate the objective.
Example: Optimizing chemical reaction conditions.

Pseudo-Code:

Design initial experiments (e.g., Latin hypercube)  
while not converged:  
    Fit surrogate model to data  
    Propose new points via model predictions

9. Machine Learning-Specific

Adam

Theory: Combines momentum and adaptive learning rates.
Example: Training transformers for NLP.

Pseudo-Code:

Initialize m=0, v=0, β1=0.9, β2=0.999  
for t in 1 to max_iter:  
    m = β1*m + (1-β1)*g  
    v = β2*v + (1-β2)*g²  
    m_hat = m / (1-β1^t)  
    v_hat = v / (1-β2^t)  
    θ = θ - α*m_hat / (sqrt(v_hat) + ε)

RMSprop

Theory: Adapts learning rates using moving average of squared gradients.
Example: Training recurrent neural networks (RNNs).

Pseudo-Code:

Initialize cache=0, decay_rate=0.9  
while not converged:  
    cache = decay_rate*cache + (1-decay_rate)*g²  
    θ = θ - α*g / (sqrt(cache) + ε)

Adagrad
- Theory: Adapts learning rates per-parameter based on historical gradients.
- Example: Sparse data problems (e.g., click-through rate prediction).
- Pseudo-Code:
```
Initialize G=0  
while not converged:  
    G = G + g²  
    θ = θ - α*g / (sqrt(G) + ε)  
```

AdaDelta

Theory: Improves Adagrad by limiting historical gradient accumulation.
Example: Training deep autoencoders.

Pseudo-Code:

Initialize E[g²]=0, E[Δθ²]=0  
while not converged:  
    E[g²] = ρ*E[g²] + (1-ρ)*g²  
    Δθ = - (sqrt(E[Δθ²] + ε) / sqrt(E[g²] + ε)) * g  
    θ = θ + Δθ  
    E[Δθ²] = ρ*E[Δθ²] + (1-ρ)*Δθ²

Nesterov Accelerated Gradient

Theory: "Looks ahead" to adjust momentum for faster convergence.
Example: Training large-scale convolutional networks.

Pseudo-Code:

Initialize v=0, α, μ  
while not converged:  
    v_prev = v  
    v = μ*v - α*∇J(θ + μ*v)  
    θ = θ + v

10. Quantum Optimization

Quantum Annealing

Theory: Exploits quantum tunneling to find global minima of Ising models.
Example: Solving protein folding problems.

Pseudo-Code (Conceptual):

Encode problem into Hamiltonian H  
Prepare initial state |ψ₀>  
Adiabatically evolve H from H₀ to H  
Measure final state for solution

QAOA

Theory: Approximates solutions via parameterized quantum circuits.
Example: Portfolio optimization.

Pseudo-Code:

Initialize parameters γ, β  
for iterations:  
    Prepare |ψ(γ, β)> via quantum circuit  
    Measure ⟨H⟩  
    Classically optimize γ, β to minimize ⟨H⟩

11. Specialized Methods

Quadratic Programming
- Theory: Optimizes quadratic objectives with linear constraints.
- Example: Portfolio optimization (Markowitz model).
- Pseudo-Code:
```
Convert to standard form: min 0.5xᵀQx + cᵀx s.t. Ax ≤ b  
Solve via active-set or interior-point methods  
```
Semidefinite Programming
- Theory: Optimizes over symmetric positive semidefinite matrices.
- Example: Sensor network localization.
- Pseudo-Code:
```
Convert to standard form: min C • X s.t. A_i • X = b_i, X ⪰ 0  
Solve via interior-point methods  
```

Interval Methods

Theory: Uses interval arithmetic to bound solutions rigorously.
Example: Verified global optimization in robotics.

Pseudo-Code:

Initialize interval X  
while width(X) > tolerance:  
    Subdivide X into subintervals  
    Prune subintervals violating constraints

Five Most Important Algorithms

Below is a comprehensive breakdown of the five optimization algorithms, including theoretical foundations, real-world examples, and concurrent Go implementations. I'll focus on clarity, correctness, and parallelism.

1. Genetic Algorithm (GA)

Theory

Inspired by natural selection and genetics
Key Steps:
1. Population: Initialize candidate solutions
2. Fitness: Evaluate solution quality
3. Selection: Choose parents (e.g., tournament selection)
4. Crossover: Combine parent genes
5. Mutation: Randomly alter genes
6. Replacement: Create new generation

Examples

Logistics Routing: Optimize delivery routes for fuel efficiency
Neural Architecture Search: Evolve optimal neural network structures
Drug Discovery: Find molecular configurations with desired properties

Concurrent Go Implementation

package main

import (
	"math/rand"
	"sync"
	"time"
)

type Individual struct {
	Genes []float64
	Fitness float64
}

func evaluate(individual *Individual, wg *sync.WaitGroup) {
	defer wg.Done()
	// Example: Minimize sum of squares
	sum := 0.0
	for _, gene := range individual.Genes {
		sum += gene * gene
	}
	individual.Fitness = sum
}

func geneticAlgorithm(popSize, dimensions int, generations int) *Individual {
	rand.Seed(time.Now().UnixNano())
	population := make([]Individual, popSize)

	// Initialize population
	for i := range population {
		genes := make([]float64, dimensions)
		for j := range genes {
			genes[j] = rand.NormFloat64()
		}
		population[i] = Individual{Genes: genes}
	}

	for gen := 0; gen < generations; gen++ {
		var wg sync.WaitGroup
		// Concurrent fitness evaluation
		for i := range population {
			wg.Add(1)
			go evaluate(&population[i], &wg)
		}
		wg.Wait()

		// Selection, crossover, mutation (simplified)
		newPopulation := make([]Individual, popSize)
		for i := 0; i < popSize; i += 2 {
			parent1 := tournamentSelection(population)
			parent2 := tournamentSelection(population)
			child1, child2 := crossover(parent1, parent2)
			newPopulation[i] = mutate(child1)
			newPopulation[i+1] = mutate(child2)
		}
		population = newPopulation
	}

	best := &population[0]
	for i := range population {
		if population[i].Fitness < best.Fitness {
			best = &population[i]
		}
	}
	return best
}

2. Ant Colony Optimization (ACO)

Theory

Mimics ant foraging behavior using pheromones
Key Components:
1. Pheromone Matrix: Chemical trail intensity between nodes
2. Path Construction: Probabilistic node selection
3. Pheromone Update: Evaporation + reinforcement

Examples

Network Routing: Optimize data packet paths in telecommunications
Task Scheduling: Schedule jobs on parallel machines
Traveling Salesman: Find shortest round-trip routes

Concurrent Go Implementation

package main

import (
	"math"
	"math/rand"
	"sync"
)

type Ant struct {
	Tour     []int
	Visited  []bool
	Distance float64
}

func acoParallel(graph [][]float64, ants int, iterations int) []int {
	pheromones := initializePheromones(len(graph))
	bestTour := make([]int, len(graph))
	bestLength := math.MaxFloat64
	var mutex sync.Mutex

	for iter := 0; iter < iterations; iter++ {
		antColony := make([]Ant, ants)
		var wg sync.WaitGroup

		// Concurrent ant exploration
		for i := 0; i < ants; i++ {
			wg.Add(1)
			go func(antID int) {
				defer wg.Done()
				ant := constructTour(graph, pheromones)
				
				mutex.Lock()
				if ant.Distance < bestLength {
					bestLength = ant.Distance
					copy(bestTour, ant.Tour)
				}
				mutex.Unlock()
			}(i)
		}
		wg.Wait()
		updatePheromones(&pheromones, antColony)
	}
	return bestTour
}

func constructTour(graph [][]float64, pheromones [][]float64) Ant {
	// Implementation of probabilistic path construction
	return Ant{} 
}

3. Particle Swarm Optimization (PSO)

Theory

Social behavior of bird flocks/fish schools
Key Equations:
- Velocity: v = w*v + c1*rand()*(pbest - x) + c2*rand()*(gbest - x)
- Position: x = x + v

Examples

Drone Swarm Control: Coordinate UAV formations
Hyperparameter Tuning: Optimize ML model parameters
Antenna Design: Maximize radiation efficiency

Concurrent Go Implementation

package main

import (
	"math"
	"math/rand"
	"sync"
)

type Particle struct {
	Position []float64
	Velocity []float64
	PBest    []float64
	PBestFit float64
}

func psoConcurrent(dimensions, particles int, epochs int) []float64 {
	swarm := make([]Particle, particles)
	gbest := make([]float64, dimensions)
	gbestFit := math.MaxFloat64
	var mutex sync.Mutex

	// Initialize swarm
	for i := range swarm {
		// ... initialization code ...
	}

	for epoch := 0; epoch < epochs; epoch++ {
		var wg sync.WaitGroup
		
		// Concurrent particle updates
		for i := range swarm {
			wg.Add(1)
			go func(i int) {
				defer wg.Done()
				updateVelocity(&swarm[i])
				updatePosition(&swarm[i])
				
				currentFit := fitness(swarm[i].Position)
				if currentFit < swarm[i].PBestFit {
					swarm[i].PBest = append([]float64{}, swarm[i].Position...)
					swarm[i].PBestFit = currentFit
					
					mutex.Lock()
					if currentFit < gbestFit {
						gbestFit = currentFit
						gbest = append([]float64{}, swarm[i].Position...)
					}
					mutex.Unlock()
				}
			}(i)
		}
		wg.Wait()
	}
	return gbest
}

4. Bayesian Optimization (BO)

Theory

Surrogate Model: Gaussian Process (GP) approximates objective
Acquisition Function: Balances exploration/exploitation (e.g., EI, UCB)

Examples

Hyperparameter Tuning: Optimize learning rates for DL models
Material Science: Discover alloys with desired properties
Robotics: Optimize controller parameters

Concurrent Go Implementation

package main

import (
	"gonum.org/v1/gonum/stat/distmv"
)

type GaussianProcess struct {
	// GP implementation
}

func bayesianOptimizationConcurrent() {
	// Parallel evaluation of candidate points
	var wg sync.WaitGroup
	candidates := generateCandidates()
	results := make(chan float64, len(candidates))

	for _, x := range candidates {
		wg.Add(1)
		go func(point []float64) {
			defer wg.Done()
			results <- evaluate(point)
		}(x)
	}

	wg.Wait()
	close(results)

	// Update surrogate model with new data
	// Select next point via acquisition function
}

5. Simulated Annealing (SA)

Theory

Temperature Schedule: Controls acceptance probability
Metropolis Criterion: P(accept) = exp(-ΔE/T)

Examples

VLSI Chip Design: Optimize component placement
Portfolio Optimization: Balance risk/return
Protein Folding: Find minimum-energy configurations

Concurrent Go Implementation

package main

import (
	"math"
	"math/rand"
	"sync"
)

func simulatedAnnealingConcurrent(initialSolution []float64, maxIter int) []float64 {
	current := initialSolution
	best := append([]float64{}, current...)
	T := 1000.0
	cooling := 0.95
	var mutex sync.Mutex

	for iter := 0; iter < maxIter; iter++ {
		var wg sync.WaitGroup
		neighbors := generateNeighbors(current)
		results := make(chan []float64, len(neighbors))

		// Concurrent neighbor evaluation
		for _, neighbor := range neighbors {
			wg.Add(1)
			go func(n []float64) {
				defer wg.Done()
				if accept(n, current, T) {
					results <- n
				}
			}(neighbor)
		}

		wg.Wait()
		close(results)

		// Process results
		for newSol := range results {
			mutex.Lock()
			if energy(newSol) < energy(best) {
				best = newSol
			}
			current = newSol
			mutex.Unlock()
		}

		T *= cooling
	}
	return best
}

Key Implementation Notes

Concurrency Patterns:
- Goroutines for parallel fitness evaluations
- sync.WaitGroup for task synchronization
- sync.Mutex for shared resource protection
Parameter Tuning:
- GA: Mutation rate (0.01-0.1), crossover rate (0.7-0.9)
- PSO: ω = 0.729, c1 = c2 = 1.494
- SA: Cooling rate (0.95-0.99)
Benchmark Functions:
- Sphere: f(x) = Σx²
- Rastrigin: f(x) = 10n + Σ(x² - 10cos(2πx))
- Ackley: Multimodal function with many local minima

lavantien/optimization-algorithms-overview.md

Applications Overview

1. Gradient Descent (GD) & Variants (SGD, Adam)

2. Genetic Algorithms (GA)

3. Simplex Method (Linear Programming)

4. Particle Swarm Optimization (PSO)

5. Interior-Point Methods

6. Bayesian Optimization

7. Simulated Annealing (SA)

8. L-BFGS (Limited-Memory BFGS)

9. NSGA-II (Multi-Objective)

10. Alternating Direction Method of Multipliers (ADMM)

Why These Algorithms Dominate

Emerging Trends

Visualization

1. Gradient Descent

2. Genetic Algorithm

3. Simplex Method (Linear Programming)

4. Particle Swarm Optimization (PSO)

5. Adam Optimizer

6. Simulated Annealing

7. Ant Colony Optimization (ACO)

8. Bayesian Optimization

9. NSGA-II (Multi-Objective)

10. ADMM

11. Levenberg-Marquardt

12. Nelder-Mead (Simplex)

13. QAOA (Quantum Optimization)

14. Branch and Bound

15. Differential Evolution

Theoretical and Implementation

1. Classical Optimization Algorithms

Gradient-Based Methods

Derivative-Free Methods

Linear Programming

Nonlinear Programming

2. Metaheuristics (Global Optimization)

Evolutionary Algorithms

Swarm Intelligence

Physics/Chemistry-Inspired

Hybrids

3. Stochastic Optimization

4. Convex Optimization

5. Constrained Optimization

6. Combinatorial Optimization

7. Multi-Objective Optimization

8. Derivative-Free Optimization

9. Machine Learning-Specific

10. Quantum Optimization

11. Specialized Methods

Five Most Important Algorithms

1. Genetic Algorithm (GA)

Theory

Examples

Concurrent Go Implementation

2. Ant Colony Optimization (ACO)

Theory

Examples

Concurrent Go Implementation

3. Particle Swarm Optimization (PSO)

Theory

Examples

Concurrent Go Implementation

4. Bayesian Optimization (BO)

Theory

Examples

Concurrent Go Implementation

5. Simulated Annealing (SA)

Theory

Examples

Concurrent Go Implementation

Key Implementation Notes