Skip to content

Instantly share code, notes, and snippets.

@ruvnet
Created May 28, 2026 16:54
Show Gist options
  • Select an option

  • Save ruvnet/dad3948cdc1608bcbfab0befa6b76c1d to your computer and use it in GitHub Desktop.

Select an option

Save ruvnet/dad3948cdc1608bcbfab0befa6b76c1d to your computer and use it in GitHub Desktop.
GAIA L1 iter 65 — Co-Sight DAG: 3 tool fixes + full 53Q validation (25/53, 47.2%)

GAIA L1 iter 65 — Co-Sight DAG: 3 tool fixes + full 53Q validation

Branch: feat/iter-64-cosight-dag | PR: #2218 | Issue: #2156

Score: 25/53 (47.2%) — cost $1.16 (3.9× cheaper than single-agent)

Benchmark progression

iter Architecture Score Cost Notes
53b Single-agent + attachment tools 30/53 (56.6%) ~$4.5 ceiling with tools
56 Single-agent CodeAgent 30/53 (56.6%) ~$4.5 current single-agent ceiling
63b Convergence n=2 19/53 (35.8%) ~$9 double-agent but worse
64 DAG 5Q pilot 3/5 (60%) $0.11 small sample
65 DAG full 53Q 25/53 (47.2%) $1.16 this run

3 fixes applied (iter 65)

Fix 1 — python_exec tool (gaia-tools/python_exec.ts, new file)

  • PythonExecTool added to actor tool catalogue
  • Executes Python 3 via stdin pipe (injection-safe, 30s timeout, 4000-char cap)
  • Confirmed working: f918266a (Python file execution) → CORRECT

Fix 2 — DOCX extraction (gaia-tools/file_read.ts)

  • extractDocx() using python-docx subprocess (same runPython() pattern as openpyxl)
  • Moves .docx from STUB_BINARY_TYPES to HANDLED_BINARY_TYPES
  • File readable but cffe0e32 still wrong (reasoning issue, not extraction)

Fix 3 — hyphen/dash normalization (gaia-judge.ts)

  • normaliseAnswer() collapses U+002D, U+2012, U+2013, U+2014 → space
  • 6 regression tests added (hyphen-minus, en-dash, em-dash, no false positives on "80GSFC21M0002")
  • Confirmed: 46719c30 "Human-Oriented" → CORRECT (was WRONG in iter 64)

Full 53Q metrics

Score:      25/53 (47.17%)
Cost:       $1.16 total ($0.022/Q avg)
Avg steps:  1.83 steps/Q
Wall time:  43.1 s/Q mean

Overlap analysis vs iter 56 (single-agent ceiling)

  • 19 questions correct in both (solid base)
  • 6 new DAG wins (questions single-agent couldn't): 3cef3a44, 42576abe, 5a0c1adf, 5cfb274c, 99c9cc74, c714ab3a
  • 11 regressions (single-agent correct, DAG wrong)

Failure mode breakdown (28 wrong)

Category Count Root cause
Blocked tool steps 12 grounded_query can't access private URLs (ClinicalTrials.gov, specific JSTOR pages)
Genuinely wrong factual ~11 knowledge gaps in grounded search
Off-by-unit / format 3 17000 vs 17, $89,706 vs 89706, comma spacing
Answer verbosity 2 full sentence instead of single word

Why DAG is below single-agent ceiling

The DAG architecture gets 6 genuine new wins through multi-step reasoning decomposition, but loses 11 that single-agent handles correctly. Key insight: 12/28 wrong answers have at least one blocked step — when grounded_query can't retrieve a dependency, the finalizer returns null/failure instead of reasoning from partial evidence.

Single-agent CodeAgent has richer fallback (visit_webpage, Python execution) and can self-redirect mid-stream. DAG is more rigid: a blocked step breaks its dependency chain.

iter 66 priorities

  1. Graceful degradation: finalizer must produce best-effort answer from partial step results
  2. visit_webpage tool: add to actor catalogue so actors can follow specific URLs (not just search)
  3. Answer extraction tightening: strip verbose preamble ("X stands for Y" → "Y")
  4. Unit/format normalization: detect scaling errors, strip currency symbols

Files changed

  • v3/@claude-flow/cli/src/benchmarks/gaia-tools/python_exec.ts (new)
  • v3/@claude-flow/cli/src/benchmarks/gaia-tools/file_read.ts
  • v3/@claude-flow/cli/src/benchmarks/gaia-tools/index.ts
  • v3/@claude-flow/cli/src/benchmarks/gaia-judge.ts
  • docs/benchmarks/runs/gaia-l1-iter65-dag-full.json (benchmark artifact)

Commit: b6e1e4f68

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment