This is a multi-agent code review system built on Claude Code and claude -p (the non-interactive CLI). It reviews large PRs by breaking them into groups, examining them in dependency order, and producing a verified findings report.
The system uses three types of Claude sessions, each with a custom system prompt and different tool access:
- Tool-free reviewer agents (
claude -p --tools "") — no file access, no code execution. They receive everything they need as text input: diffs, focus instructions, known facts, PR description. They can only read and reason. This constraint is intentional — it prevents reviewers from wandering through the codebase and forces all context to be curated upfront. Their system prompt defines what to flag (logic errors, leaks, races, broken API contracts), what to skip (style, missing tests), and critically — what to do when uncertain (emit a QUESTION finding with a specific verifiable question, rather than building on assumptions). - The executor — a Claude Code session with a custom system prompt and restricted tools (Read, Write, Edit, Bash, Grep, Glob). It orchestrates the review: assembles reviewer inputs, dispatches agents, verifies findings against the codebase between turns, and manages the multi-turn flow. Its system prompt emphasizes stopping after each step to report results, not loading diffs into its own context, and monitoring its context growth.
- The designer/coordinator — a Claude Code session that plans the review (grouping, exclusions, known-facts pre-verification), audits the plan's consistency, then stays active as the coordinator and auditor.
A synthesis agent (also tool-free, with its own system prompt) merges findings into the final report. Its prompt emphasizes grouping by narrative rather than severity, preserving evidence, and not making merge recommendations.
The review runs in two distinct concurrent sessions.
The design session takes a branch and produces a self-contained review plan. It maps the branch with git diff --numstat, classifies files by subsystem, groups them by review concern, identifies exclusions (binaries, already-reviewed code, transferred code), and pre-verifies technical facts that reviewers would otherwise waste turns asking about (e.g., "does this API clean up on overwrite?"). The output is a review directory with filtered diffs, context files, known facts, and a startup document specifying exactly how the review should proceed. This session then stays active as the coordinator and auditor.
The execution session reads the startup document and runs the plan. It dispatches the tool-free reviewer agents, manages the multi-turn conversation flow, and verifies findings against the codebase between turns. Each reviewer is a multi-turn claude -p session that receives the relevant diffs, focus instructions, known facts, and context. The reviewer examines the code, reports findings, and flags uncertainties as explicit QUESTIONs — rather than building analysis on unverified assumptions. Between turns, the executor verifies those questions by reading the actual codebase, then feeds answers back into the reviewer's next turn.
The system dispatches one or more sequential reviewers depending on natural splitting points in the dependency graph. For PRs that span foundation and consumer code, this typically means:
- Reviewer 1 examines foundations (e.g., the GPU abstraction layer, import pipeline, entity system). It produces findings plus an API changes summary describing what changed at each interface.
- Reviewer 2 examines consumers (e.g., the rendering pipeline, shaders, application integration). It receives Reviewer 1's findings and API summary as input, so it can verify that consumer code correctly uses the foundation APIs.
Groups that need to be seen together belong in the same reviewer. Groups that can be reasonably split go to separate reviewers — the system isn't limited to two.
Each reviewer goes through examination turns (one group per turn), then closing turns:
- Reflection — reconsider all findings with the full picture, catch cross-group patterns
- Summary — complete findings report
- Post-mortem — what was hard to assess, what context was missing, what would help next time
Findings are verified at two levels:
- Between turns — the executor resolves QUESTION findings by reading code, running tests, or checking the PR description. Answers feed back into the reviewer's next turn.
- Before synthesis — after all reviewers complete, the executor independently verifies all substantive findings against the codebase. Confirmed, dismissed, or deferred (to the user) — each finding is annotated before going to synthesis.
The synthesis agent merges the verified findings from all reviewers into a final report. It groups findings by narrative (what's wrong and why), preserves file:line references as clickable GitHub links, identifies cross-cutting patterns, and lists verified-clean areas.
After the report is delivered, both the executor and coordinator sessions reflect on the process — what went well, what went wrong, cache behavior, cost observations, and suggestions for the next review. These post-mortems, combined with the reviewer post-mortems, feed into future review designs as accumulated knowledge about context gaps and process improvements.
- Summary table of all findings with severity, type, and linked source locations
- Grouped findings with detailed descriptions, evidence, and suggested fixes
- Cross-cutting patterns — issues that appear across subsystems
- Clean areas — everything that was verified correct (often the most valuable section for the PR author)
- Limitations — what couldn't be verified and why, synthesized from reviewer post-mortems