Branch: feat/iter-64-cosight-dag | PR: #2218 | Issue: #2156
| iter | Architecture | Score | Cost | Notes |
|---|---|---|---|---|
| 53b | Single-agent + attachment tools | 30/53 (56.6%) | ~$4.5 | ceiling with tools |
| 56 | Single-agent CodeAgent | 30/53 (56.6%) | ~$4.5 | current single-agent ceiling |
| 63b | Convergence n=2 | 19/53 (35.8%) | ~$9 | double-agent but worse |
| 64 | DAG 5Q pilot | 3/5 (60%) | $0.11 | small sample |
| 65 | DAG full 53Q | 25/53 (47.2%) | $1.16 | this run |
Fix 1 — python_exec tool (gaia-tools/python_exec.ts, new file)
PythonExecTooladded to actor tool catalogue- Executes Python 3 via stdin pipe (injection-safe, 30s timeout, 4000-char cap)
- Confirmed working: f918266a (Python file execution) → CORRECT
Fix 2 — DOCX extraction (gaia-tools/file_read.ts)
extractDocx()using python-docx subprocess (samerunPython()pattern as openpyxl)- Moves
.docxfromSTUB_BINARY_TYPEStoHANDLED_BINARY_TYPES - File readable but cffe0e32 still wrong (reasoning issue, not extraction)
Fix 3 — hyphen/dash normalization (gaia-judge.ts)
normaliseAnswer()collapses U+002D, U+2012, U+2013, U+2014 → space- 6 regression tests added (hyphen-minus, en-dash, em-dash, no false positives on "80GSFC21M0002")
- Confirmed: 46719c30 "Human-Oriented" → CORRECT (was WRONG in iter 64)
Score: 25/53 (47.17%)
Cost: $1.16 total ($0.022/Q avg)
Avg steps: 1.83 steps/Q
Wall time: 43.1 s/Q mean
- 19 questions correct in both (solid base)
- 6 new DAG wins (questions single-agent couldn't): 3cef3a44, 42576abe, 5a0c1adf, 5cfb274c, 99c9cc74, c714ab3a
- 11 regressions (single-agent correct, DAG wrong)
| Category | Count | Root cause |
|---|---|---|
| Blocked tool steps | 12 | grounded_query can't access private URLs (ClinicalTrials.gov, specific JSTOR pages) |
| Genuinely wrong factual | ~11 | knowledge gaps in grounded search |
| Off-by-unit / format | 3 | 17000 vs 17, $89,706 vs 89706, comma spacing |
| Answer verbosity | 2 | full sentence instead of single word |
The DAG architecture gets 6 genuine new wins through multi-step reasoning decomposition, but loses 11 that single-agent handles correctly. Key insight: 12/28 wrong answers have at least one blocked step — when grounded_query can't retrieve a dependency, the finalizer returns null/failure instead of reasoning from partial evidence.
Single-agent CodeAgent has richer fallback (visit_webpage, Python execution) and can self-redirect mid-stream. DAG is more rigid: a blocked step breaks its dependency chain.
- Graceful degradation: finalizer must produce best-effort answer from partial step results
- visit_webpage tool: add to actor catalogue so actors can follow specific URLs (not just search)
- Answer extraction tightening: strip verbose preamble ("X stands for Y" → "Y")
- Unit/format normalization: detect scaling errors, strip currency symbols
v3/@claude-flow/cli/src/benchmarks/gaia-tools/python_exec.ts(new)v3/@claude-flow/cli/src/benchmarks/gaia-tools/file_read.tsv3/@claude-flow/cli/src/benchmarks/gaia-tools/index.tsv3/@claude-flow/cli/src/benchmarks/gaia-judge.tsdocs/benchmarks/runs/gaia-l1-iter65-dag-full.json(benchmark artifact)
Commit: b6e1e4f68