Analysis: Swarm VRPTW Algorithms vs. Literature

Context

27 Claude Code agents collaboratively optimized the Vehicle Routing Problem with Time Windows (VRPTW) on 400-node Homberger (HG) instances over ~4 hours on 2026-04-18. Starting from a Solomon I1 baseline (score 9525), the swarm achieved 6791 — a 28.7% improvement. Each agent independently evolved Rust solvers within a 30-second-per-instance time limit, single-threaded.

Key Findings

1. The Swarm Independently Rediscovered 30 Years of VRP Literature

The agents converged on an architecture nearly identical to what the academic community considers state-of-the-art for VRPTW. The final top solutions are structurally similar to:

Pisinger & Ropke (2007) — ALNS (Adaptive Large Neighborhood Search)
Vidal et al. (2012, 2022) — HGS (Hybrid Genetic Search) operator set
Ropke & Pisinger (2006) — Shaw removal, regret insertion
Potvin & Rousseau (1995) — 2-opt*, or-opt for VRPTW

Every competitive agent (top 12) independently arrived at some combination of:

Solomon I1 or regret-2 construction
RVND (Randomized Variable Neighborhood Descent) with 6-8 operators
ILS/ALNS destroy-repair metaheuristic
Shaw/worst/random destroy + regret-2/greedy repair

This is striking because no agent was told to read the literature — they derived these techniques from first principles and by studying each other's code through the branching mechanism.

2. Progression Mirrors the Historical Development of VRPTW Solvers

Stage	Score Range	Equivalent Era	Key Unlock
Solomon baseline	9525	1987 (Solomon)	Construction heuristic only
+Local search	7600-8500	1990s (Potvin)	2-opt, relocate, exchange
+Destroy-repair	7000-7600	2006 (Ropke & Pisinger)	ALNS with random/worst/Shaw removal
+SA acceptance	6900-7000	2007 (Pisinger & Ropke)	Stochastic acceptance escapes local optima
+RVND + K-neighbors	6800-6900	2012-2022 (Vidal, HGS)	Neighbor-limited search, randomized operator order
+Route struct with lat arrays	6791	Modern (Vidal 2022)	O(1) feasibility → more iterations per second

3. What Separated the Top 5 from the Pack

The critical dividing line is at ~6930 (rank 8). Below that, agents share the same architecture. What pushed the top 5 to 6791-6883:

Technique	Top 5 Have It	Rank 8-12 Missing It
Stagnation-adaptive destroy %	Yes (scales 8%→30%+)	Fixed or simple scaling
Route elimination operator	All 5	Most have it but less aggressive
Multiple repair strategies	regret-2 + greedy + noisy	Usually regret-2 only
K=40 neighbor lists	All 5	Some use K=30 or none
Merge-to-fleet preprocessing	Top 1 (agile-fox)	Rely on construction alone

4. Most Interesting Individual Approaches

Agile-fox (#1, 6791) — Won with the simplest acceptance criterion (pure best-improvement, no SA). This is counterintuitive — the literature strongly favors SA or threshold acceptance. Agile-fox compensated with aggressive stagnation scaling (+2% destroy per non-improving iteration) and a merge-to-fleet preprocessing step that other agents lacked. Its nearest-neighbor construction (not Solomon) also suggests the initial solution matters less than the metaheuristic quality.

Noble-wolf (#13, 7089, 64KB) — The most ambitious codebase. Implemented the full HGS-style swap* operator (remove two customers, reinsert optimally), noisy greedy repair, and the most sophisticated Route struct with O(1) try_insert(). Despite having the richest toolbox, it placed only 13th. The lesson: implementation complexity doesn't correlate with performance. Noble-wolf likely spent too many iterations in expensive operators rather than rapid cycling. This mirrors a known tradeoff in the literature — Vidal (2022) specifically warns that operator richness must be balanced against iteration speed.

Primal-viper (#5, 6883, 50KB) — Largest code of the top 5. Notable for K=40 neighbor lists used across all operators, and alternating regret-2/greedy repair (every 4th iteration uses regret). This repair diversity is well-supported by Ropke & Pisinger's adaptive weight findings.

Silver-nova (#3, 6861) — The only top-3 agent with SA acceptance. Uses geometric cooling (T₀ = 1.2% of distance → T_end = 0.1%), which is notably more conservative than the ~5% starting temperatures common in the ALNS literature. Its lat (latest arrival time) arrays for O(1) insertion feasibility are the key technique from Vidal's HGS.

5. What's Missing vs. State-of-the-Art

The best published results on HG 400-node instances use:

Population-based methods (HGS uses a genetic algorithm with population management) — no agent attempted this
Education/crossover operators (PIX, OX, SREX) — completely absent
Penalty-based infeasible exploration — all agents rejected infeasible solutions immediately; the literature shows relaxing constraints during search and penalizing violations produces better results
Multi-objective fleet minimization — agents optimized distance only after ensuring feasibility, rather than treating fleet size as a soft constraint
Problem-specific distance caching — most agents recomputed distances rather than caching move deltas

The gap between the swarm's best (6791) and published state-of-the-art for HG 400-node (typically ~6200-6400 depending on instance mix) is roughly 6-10%. Given the 30-second time limit and single-threaded constraint, this is a strong result — published benchmarks often allow minutes.

6. The Swarm Dynamics Were Fascinating

From the message log and branching history:

Knowledge transfer worked: Agents explicitly studied each other's code. Silver-nova messaged "Studying inspiration from noble-wolf. Key insights: neighbor-based local search (O(nK) instead of O(n²)), node locator for O(1) position lookups" and then jumped from 7021 → 6861.
The coin-flip branching created diversity: The 50/50 mechanism (global best vs. random peer) prevented convergence to a monoculture. Agents working from weaker branches sometimes found novel techniques.
Failed hypotheses were valuable: 84 of 197 hypotheses failed, and the shared failure log prevented duplicate work. Cross-exchange (2,2) was repeatedly found to regress on HG instances — multiple agents reported this independently.
Diminishing returns set in quickly: The score dropped from 9525 → 7059 in the first hour (26% improvement) but only 7059 → 6791 in the remaining 3 hours (4% more). This matches the typical VRPTW optimization curve.

7. Architectural Taxonomy

The 19 scoring agents fell into 4 architectural families:

Family	Agents	Score Range	Description
ILS + RVND	agile-fox, primal-atlas, silver-nova, sharp-orbit, bright-raven, fierce-owl, noble-wolf	6791-7089	Randomized VND with ILS destroy-repair. Dominant family.
ALNS + deterministic LS	primal-viper, crystal-lynx, astral-fox, sonic-raven, azure-vector	6883-7467	ALNS metaheuristic with fixed-order local search.
SA + hybrid LS	vivid-beacon, cosmic-helix	6917-7652	Simulated annealing acceptance with mixed operators.
Construction only	keen-phoenix	9525	Solomon baseline, no optimization.

The ILS+RVND family dominated the top ranks, which aligns with recent literature favoring ILS over pure ALNS (see Accorsi & Vigo, 2021).

Summary

The experiment demonstrates that LLM agents can independently rediscover decades of combinatorial optimization research in hours. The progression from Solomon baseline to ALNS/ILS+RVND closely mirrors the historical development of VRPTW solvers (1987-2022). The main gap to state-of-the-art is the absence of population-based methods and infeasible-solution exploration — techniques that require more architectural sophistication than iterative refinement of a single solution.

The most surprising result: the winner (agile-fox) used no stochastic acceptance at all, defeating SA-equipped competitors through aggressive adaptive destruction and better preprocessing. This challenges the conventional wisdom that SA is essential for competitive VRPTW performance.

SteveDiamond/swarm-vrptw-analysis.md

Select an option

No results found