A working-practices doc, not a research claim. This is the workflow I actually use day-to-day for building software with multi-model agent loops. It's lived, not benchmarked. If something here is useful to you, take it; if it isn't, ignore it.
The diagram lives alongside this doc as build-workflow.drawio.
The Auditor is a context manager for the Manager, not just a quality gate.
Most agent loop designs treat the auditor role as pass/fail on worker output. In this workflow, the Auditor does that and something more important: it reads the full Worker output, code diff, and test results, and what flows up to the Manager is a compressed verdict plus a recommendation. The Manager never has to load raw Worker artifacts into its own context to decide what to do next.
That changes which resource you're optimizing for. Manager context is the scarcest thing in a long-running build session — it's holding the spec, the task graph, milestone state, and the history of decisions made so far. Every token of raw Worker output you keep out of it extends how long the Manager can run before context rot sets in.
This is why the Auditor needs to be a capable model, not a cheap one. If you run the Auditor on a small model, you get either bad audits or bad summaries, and either failure mode poisons the Manager's working state. In my fleet, Auditor and Worker draw from the same model tier (DS Flash V4, Qwen 3.6 35B MoE, with frontier runner-ups) — same capability, different role.
If you take one thing from this doc, take that.
Refines the idea into a spec. Spec is append-only — that's the core truth of the project. Edits are appends, not rewrites, so you get a free audit trail of how the project's requirements evolved. Anything downstream that needs "the spec" gets the resolved view, not the raw append log.
Decomposes the spec into a phased task list with audit gating at minor and major milestones. Makes the final call on retask vs. halt. The judgment is between local failures (this task got it wrong, next task can absorb the fix) and structural failures (the spec or plan is wrong, human needs to weigh in). This judgment is the Manager's actual value — don't try to formalize it into a rule.
Executes one task at a time. Runs the automated tests for that task. Stops at designated checkpoints and reports out for verification. Scope discipline is what makes local models viable here — a 35B MoE doing one well-defined task with tests is a different problem than the same model trying to plan.
Audits Worker output against codebase and spec. Also compresses that audit into a contracted output format (see below) for the Manager. Reports findings and recommended action.
Spec conformance gate. Tests the running program with full project context. Applies surgical edits within bounded scope. If a structural fix is needed, kicks back to the Manager with diagnosis — which re-enters the audited build loop. If pass, closes session.
This role needs the largest available context window, not just the strongest reasoner. It's the one role that can't be scoped down, because validating against spec means validating against everything.
Manages physical handoffs between models. Can halt any window. Gets prompted by the Manager when a structural decision needs human judgment.
This is the load-bearing piece. The Auditor's value as a context filter only works if its output format is consistent.
verdict: pass | fail | partial
diagnosis: one line
rec_action: continue | retask | escalate
pointer: file / function / test
The Manager pulls more detail only on demand. A verbose Auditor that writes paragraphs of findings defeats the purpose. A terse one that just says "fail, retry" loses the information needed to decide local vs. structural.
This contract gets enforced in the Auditor's system prompt. It's one of the few places in the workflow where being prescriptive about output format pays off significantly.
- Idea → Planner → spec append.
- Manager reads resolved spec, decomposes into task list.
- Manager hands a task to a Worker.
- Worker executes, runs tests, reports.
- Auditor reads full output + diff + tests. Emits contracted verdict to Manager.
- Manager decides: continue to next task, retask (same or different prompt), or halt for user input.
- At major milestones, Exit Manager validates whole-project conformance.
- Exit Manager passes → done. Structural fix needed → kicks back to Manager (audited path). Surgical fix in scope → applies and continues.
The user can interrupt any stage at any time. That's not a feature, it's a property of the workflow being human-in-the-loop rather than autonomous.
Append-only matters more than it sounds. Two reasons:
- No drift. The Manager and Exit Manager are always validating against the same accumulated truth. There's no "version of the spec the Worker thought it was building against."
- Free history. Every requirement change is timestamped and contextual. When you look back at why a decision was made, the spec evolution shows you.
The thing to be careful about: hand downstream roles the resolved view of the spec, not the raw append log. Otherwise they'll evaluate against superseded requirements.
Frontier (Planner / Manager / Exit Manager)
- Primary: Opus 4.6/4.7, GPT 5.4/5.5
- Runner-ups: Gemini Pro 3.1, GLM 5.1, Kimi 2.6, Qwen 3.6 MAX
- Exit Manager: bias toward largest context window available.
Worker / Auditor (same tier — both jobs need real capability)
- Primary: DS Flash V4, Qwen 3.6 35B MoE (local)
- Runner-ups: GLM 5.1, Kimi 2.6, Sonnet 4.6, Gemini Flash 3.0, Gemini Pro 3.1, GPT 5.3
Local models cover the Worker tier when latency or sovereignty matters. The MoE preference is real — small active parameter counts hit the tok/s threshold (30–60 tok/s minimum) that makes the loop usable rather than painful.
It isn't autonomous. It isn't benchmarked against alternatives. It isn't a framework you can pip install. It's how I work, written down. The framing might transfer; the specific model choices won't survive contact with the next six months.
The thing I'd most want someone else to take from it is the Auditor framing. The role boxes and arrows are scaffolding. The insight that the Auditor is doing context management — that's the part worth stealing.
Build workflow diagram: build-workflow.drawio