GAIA L1 iter 65 — Co-Sight DAG: 3 tool fixes + full 53Q validation

Branch: feat/iter-64-cosight-dag | PR: #2218 | Issue: #2156

Score: 25/53 (47.2%) — cost $1.16 (3.9× cheaper than single-agent)

Benchmark progression

iter	Architecture	Score	Cost	Notes
53b	Single-agent + attachment tools	30/53 (56.6%)	~$4.5	ceiling with tools
56	Single-agent CodeAgent	30/53 (56.6%)	~$4.5	current single-agent ceiling
63b	Convergence n=2	19/53 (35.8%)	~$9	double-agent but worse
64	DAG 5Q pilot	3/5 (60%)	$0.11	small sample
65	DAG full 53Q	25/53 (47.2%)	$1.16	this run

3 fixes applied (iter 65)

Fix 1 — python_exec tool (gaia-tools/python_exec.ts, new file)

PythonExecTool added to actor tool catalogue
Executes Python 3 via stdin pipe (injection-safe, 30s timeout, 4000-char cap)
Confirmed working: f918266a (Python file execution) → CORRECT

Fix 2 — DOCX extraction (gaia-tools/file_read.ts)

extractDocx() using python-docx subprocess (same runPython() pattern as openpyxl)
Moves .docx from STUB_BINARY_TYPES to HANDLED_BINARY_TYPES
File readable but cffe0e32 still wrong (reasoning issue, not extraction)

Fix 3 — hyphen/dash normalization (gaia-judge.ts)

normaliseAnswer() collapses U+002D, U+2012, U+2013, U+2014 → space
6 regression tests added (hyphen-minus, en-dash, em-dash, no false positives on "80GSFC21M0002")
Confirmed: 46719c30 "Human-Oriented" → CORRECT (was WRONG in iter 64)

Full 53Q metrics

Score:      25/53 (47.17%)
Cost:       $1.16 total ($0.022/Q avg)
Avg steps:  1.83 steps/Q
Wall time:  43.1 s/Q mean

Overlap analysis vs iter 56 (single-agent ceiling)

19 questions correct in both (solid base)
6 new DAG wins (questions single-agent couldn't): 3cef3a44, 42576abe, 5a0c1adf, 5cfb274c, 99c9cc74, c714ab3a
11 regressions (single-agent correct, DAG wrong)

Failure mode breakdown (28 wrong)

Category	Count	Root cause
Blocked tool steps	12	grounded_query can't access private URLs (ClinicalTrials.gov, specific JSTOR pages)
Genuinely wrong factual	~11	knowledge gaps in grounded search
Off-by-unit / format	3	17000 vs 17, $89,706 vs 89706, comma spacing
Answer verbosity	2	full sentence instead of single word

Why DAG is below single-agent ceiling

The DAG architecture gets 6 genuine new wins through multi-step reasoning decomposition, but loses 11 that single-agent handles correctly. Key insight: 12/28 wrong answers have at least one blocked step — when grounded_query can't retrieve a dependency, the finalizer returns null/failure instead of reasoning from partial evidence.

Single-agent CodeAgent has richer fallback (visit_webpage, Python execution) and can self-redirect mid-stream. DAG is more rigid: a blocked step breaks its dependency chain.

iter 66 priorities

Graceful degradation: finalizer must produce best-effort answer from partial step results
visit_webpage tool: add to actor catalogue so actors can follow specific URLs (not just search)
Answer extraction tightening: strip verbose preamble ("X stands for Y" → "Y")
Unit/format normalization: detect scaling errors, strip currency symbols

Files changed

v3/@claude-flow/cli/src/benchmarks/gaia-tools/python_exec.ts (new)
v3/@claude-flow/cli/src/benchmarks/gaia-tools/file_read.ts
v3/@claude-flow/cli/src/benchmarks/gaia-tools/index.ts
v3/@claude-flow/cli/src/benchmarks/gaia-judge.ts
docs/benchmarks/runs/gaia-l1-iter65-dag-full.json (benchmark artifact)

Commit: b6e1e4f68

ruvnet/iter65-gist.md

Select an option

No results found