27 Claude Code agents collaboratively optimized the Vehicle Routing Problem with Time Windows (VRPTW) on 400-node Homberger (HG) instances over ~4 hours on 2026-04-18. Starting from a Solomon I1 baseline (score 9525), the swarm achieved 6791 — a 28.7% improvement. Each agent independently evolved Rust solvers within a 30-second-per-instance time limit, single-threaded.
The agents converged on an architecture nearly identical to what the academic community considers state-of-the-art for VRPTW. The final top solutions are structurally similar to:
- Pisinger & Ropke (2007) — ALNS (Adaptive Large Neighborhood Search)
- Vidal et al. (2012, 2022) — HGS (Hybrid Genetic Search) operator set
- Ropke & Pisinger (2006) — Shaw removal, regret insertion
- Potvin & Rousseau (1995) — 2-opt*, or-opt for VRPTW
Every competitive agent (top 12) independently arrived at some combination of:
- Solomon I1 or regret-2 construction
- RVND (Randomized Variable Neighborhood Descent) with 6-8 operators
- ILS/ALNS destroy-repair metaheuristic
- Shaw/worst/random destroy + regret-2/greedy repair
This is striking because no agent was told to read the literature — they derived these techniques from first principles and by studying each other's code through the branching mechanism.
| Stage | Score Range | Equivalent Era | Key Unlock |
|---|---|---|---|
| Solomon baseline | 9525 | 1987 (Solomon) | Construction heuristic only |
| +Local search | 7600-8500 | 1990s (Potvin) | 2-opt, relocate, exchange |
| +Destroy-repair | 7000-7600 | 2006 (Ropke & Pisinger) | ALNS with random/worst/Shaw removal |
| +SA acceptance | 6900-7000 | 2007 (Pisinger & Ropke) | Stochastic acceptance escapes local optima |
| +RVND + K-neighbors | 6800-6900 | 2012-2022 (Vidal, HGS) | Neighbor-limited search, randomized operator order |
| +Route struct with lat arrays | 6791 | Modern (Vidal 2022) | O(1) feasibility → more iterations per second |
The critical dividing line is at ~6930 (rank 8). Below that, agents share the same architecture. What pushed the top 5 to 6791-6883:
| Technique | Top 5 Have It | Rank 8-12 Missing It |
|---|---|---|
| Stagnation-adaptive destroy % | Yes (scales 8%→30%+) | Fixed or simple scaling |
| Route elimination operator | All 5 | Most have it but less aggressive |
| Multiple repair strategies | regret-2 + greedy + noisy | Usually regret-2 only |
| K=40 neighbor lists | All 5 | Some use K=30 or none |
| Merge-to-fleet preprocessing | Top 1 (agile-fox) | Rely on construction alone |
Agile-fox (#1, 6791) — Won with the simplest acceptance criterion (pure best-improvement, no SA). This is counterintuitive — the literature strongly favors SA or threshold acceptance. Agile-fox compensated with aggressive stagnation scaling (+2% destroy per non-improving iteration) and a merge-to-fleet preprocessing step that other agents lacked. Its nearest-neighbor construction (not Solomon) also suggests the initial solution matters less than the metaheuristic quality.
Noble-wolf (#13, 7089, 64KB) — The most ambitious codebase. Implemented the full HGS-style swap* operator (remove two customers, reinsert optimally), noisy greedy repair, and the most sophisticated Route struct with O(1) try_insert(). Despite having the richest toolbox, it placed only 13th. The lesson: implementation complexity doesn't correlate with performance. Noble-wolf likely spent too many iterations in expensive operators rather than rapid cycling. This mirrors a known tradeoff in the literature — Vidal (2022) specifically warns that operator richness must be balanced against iteration speed.
Primal-viper (#5, 6883, 50KB) — Largest code of the top 5. Notable for K=40 neighbor lists used across all operators, and alternating regret-2/greedy repair (every 4th iteration uses regret). This repair diversity is well-supported by Ropke & Pisinger's adaptive weight findings.
Silver-nova (#3, 6861) — The only top-3 agent with SA acceptance. Uses geometric cooling (T₀ = 1.2% of distance → T_end = 0.1%), which is notably more conservative than the ~5% starting temperatures common in the ALNS literature. Its lat (latest arrival time) arrays for O(1) insertion feasibility are the key technique from Vidal's HGS.
The best published results on HG 400-node instances use:
- Population-based methods (HGS uses a genetic algorithm with population management) — no agent attempted this
- Education/crossover operators (PIX, OX, SREX) — completely absent
- Penalty-based infeasible exploration — all agents rejected infeasible solutions immediately; the literature shows relaxing constraints during search and penalizing violations produces better results
- Multi-objective fleet minimization — agents optimized distance only after ensuring feasibility, rather than treating fleet size as a soft constraint
- Problem-specific distance caching — most agents recomputed distances rather than caching move deltas
The gap between the swarm's best (6791) and published state-of-the-art for HG 400-node (typically ~6200-6400 depending on instance mix) is roughly 6-10%. Given the 30-second time limit and single-threaded constraint, this is a strong result — published benchmarks often allow minutes.
From the message log and branching history:
- Knowledge transfer worked: Agents explicitly studied each other's code. Silver-nova messaged "Studying inspiration from noble-wolf. Key insights: neighbor-based local search (O(nK) instead of O(n²)), node locator for O(1) position lookups" and then jumped from 7021 → 6861.
- The coin-flip branching created diversity: The 50/50 mechanism (global best vs. random peer) prevented convergence to a monoculture. Agents working from weaker branches sometimes found novel techniques.
- Failed hypotheses were valuable: 84 of 197 hypotheses failed, and the shared failure log prevented duplicate work. Cross-exchange (2,2) was repeatedly found to regress on HG instances — multiple agents reported this independently.
- Diminishing returns set in quickly: The score dropped from 9525 → 7059 in the first hour (26% improvement) but only 7059 → 6791 in the remaining 3 hours (4% more). This matches the typical VRPTW optimization curve.
The 19 scoring agents fell into 4 architectural families:
| Family | Agents | Score Range | Description |
|---|---|---|---|
| ILS + RVND | agile-fox, primal-atlas, silver-nova, sharp-orbit, bright-raven, fierce-owl, noble-wolf | 6791-7089 | Randomized VND with ILS destroy-repair. Dominant family. |
| ALNS + deterministic LS | primal-viper, crystal-lynx, astral-fox, sonic-raven, azure-vector | 6883-7467 | ALNS metaheuristic with fixed-order local search. |
| SA + hybrid LS | vivid-beacon, cosmic-helix | 6917-7652 | Simulated annealing acceptance with mixed operators. |
| Construction only | keen-phoenix | 9525 | Solomon baseline, no optimization. |
The ILS+RVND family dominated the top ranks, which aligns with recent literature favoring ILS over pure ALNS (see Accorsi & Vigo, 2021).
The experiment demonstrates that LLM agents can independently rediscover decades of combinatorial optimization research in hours. The progression from Solomon baseline to ALNS/ILS+RVND closely mirrors the historical development of VRPTW solvers (1987-2022). The main gap to state-of-the-art is the absence of population-based methods and infeasible-solution exploration — techniques that require more architectural sophistication than iterative refinement of a single solution.
The most surprising result: the winner (agile-fox) used no stochastic acceptance at all, defeating SA-equipped competitors through aggressive adaptive destruction and better preprocessing. This challenges the conventional wisdom that SA is essential for competitive VRPTW performance.