Date: 2026-03-24 Source: Anthropic Engineering Blog (Prithvi Rajasekaran, 2026-03-24) Supporting articles: Effective Harnesses, Context Engineering Purpose: Extract actionable patterns from Anthropic's harness research and map them against the vault workflow suite to identify gaps and improvements.
The central breakthrough: separate the agent doing the work from the agent judging it. This addresses two failure modes:
- Context degradation: Models lose coherence as context fills. Some models exhibit "context anxiety" and prematurely wrap up work near perceived context limits.
- Self-evaluation bias: Agents confidently praise their own mediocre output. Tuning a standalone evaluator is far more tractable than making a generator critical of its own work.
The pattern scales from two agents (frontend design) to three agents (full-stack apps):
Planner → Generator → Evaluator
│ │ │
│ │ ├── Uses Playwright MCP to interact like a user
│ │ ├── Grades against explicit criteria with hard thresholds
│ │ └── Failure on ANY criterion fails the entire sprint
│ │
│ ├── Works one feature at a time
│ ├── Self-evaluates before QA handoff
│ └── Has git access for version control
│
├── Converts 1-4 sentence prompts into full specs
├── Prompted to be ambitious about scope
└── Focuses on product context, NOT granular implementation
Before coding begins, generator and evaluator negotiate "done" criteria for each work chunk. The generator proposes what it will build and how success is verified. The evaluator reviews and iterates until agreement. This keeps work faithful to the spec without over-specifying implementation.
Agents communicate via files. One agent writes; another reads and responds within that file or creates new files. No shared memory, no message passing. Files are the interface.
"Every component in a harness encodes an assumption about what the model can't do on its own, and those assumptions are worth stress testing, both because they may be incorrect, and because they can quickly go stale as models improve."
Practical approach: remove one component at a time, review impact. Radical cuts degrade performance. Methodical pruning reveals which pieces are load-bearing.
The evaluator is not a fixed yes-or-no decision. For tasks within the model's natural capability boundary, evaluator overhead is unnecessary. For tasks at the capability edge, evaluator provides real value. This is dynamic: model improvements shift the boundary.
| Era | Strategy | Why |
|---|---|---|
| Pre-Opus 4.5 | Context resets (clear window, start fresh with handoffs) | Sonnet exhibited "context anxiety," prematurely wrapping up |
| Opus 4.5 | Continuous sessions with SDK compaction | Eliminated context anxiety, reduced orchestration overhead |
| Opus 4.6 | Simplified harness (no sprint decomposition needed) | Model plans more carefully, sustains tasks longer |
"Out of the box, Claude is a poor QA agent."
The evaluator initially identifies real issues then talks itself into deciding they're not a big deal. Solution: iterative prompting refinement based on divergence between evaluator judgment and human standards. Took several rounds before grading was reasonable. Even then, subtle bugs and unintuitive interactions slipped through.
| Approach | Duration | Cost | Quality |
|---|---|---|---|
| Solo agent | 20 min | $9 | Broken core feature, rigid layout, wasted space |
| Full harness | 6 hours | $200 | 16-feature spec across 10 sprints, functional physics, integrated AI |
| Simplified (Opus 4.6) DAW | 3h 50m | $124.70 | Working arrangement view, mixer, transport, agent-driven composition |
Long-running agents are like engineers working in shifts where each new engineer has no memory. Two failure modes:
- Over-ambition: agent tries to one-shot everything, runs out of context mid-implementation
- Premature completion: later agent sees partial progress and declares victory
Solution architecture:
- Initializer agent (first session): creates init.sh, progress file, feature list (200+ items in JSON), initial commit
- Coding agent (all subsequent): reads progress, picks one feature, implements, verifies, commits, updates progress
Key detail: feature list is JSON (not Markdown) because models are less likely to inappropriately modify JSON.
Code must always be "appropriate for merging to a main branch." No major bugs, well-ordered, well-documented. Each session leaves the codebase in a state where new work can begin without cleanup.
- Context rot: accuracy decreases as token count increases (n² pairwise relationships in transformers)
- Just-in-time retrieval: maintain lightweight identifiers (file paths, queries), dynamically retrieve via tools
- Compaction: summarize conversations nearing limits, preserve architectural decisions, discard redundant tool outputs
- Structured note-taking: agents write persistent notes outside context window, retrieve later
- Sub-agent architecture: focused agents explore extensively (tens of thousands of tokens) but return condensed summaries (1-2k tokens)
| Anthropic Concept | Vault Equivalent | Gap? |
|---|---|---|
| Planner Agent | /write-a-prd + /prd-to-tasks |
No. Strong match. PRD interview + task decomposition mirrors planner's spec generation. |
| Generator Agent | /do-work |
Partial. RALPH model (one task per invocation) is more conservative than continuous multi-sprint generator. |
| Evaluator Agent | Missing | Yes. No dedicated QA/evaluation step. /grill-me is pre-implementation stress testing, not post-implementation QA. |
| Sprint Contracts | Missing | Yes. No pre-work negotiation of "done" criteria between generator and evaluator. |
| File-Based Communication | progress.md, tasks.md, prd.md |
Strong match. The vault's file-first approach is already this pattern. |
| Feature List (JSON) | tasks.md (Markdown) |
Partial. Markdown is more editable but also more susceptible to inappropriate modification. |
| Context Resets | RALPH model | Intentional match. Fresh context per /do-work invocation. |
| Compaction | Claude SDK auto-compaction | Implicit. Not explicitly managed by vault skills. |
| Structured Note-Taking | progress.md |
Strong match. "Next" field in progress entries is specifically designed for agent handoff. |
| Assumption Stress-Testing | Not practiced | Gap. No periodic review of whether skill complexity is still justified by model limitations. |
| Sozluk (shared language) | Vault-unique | Advantage. No equivalent in Anthropic's harness. Shared terminology reduces ambiguity. |
| Matrix-aligned tagging | Vault-unique | Advantage. Career-aware metadata not present in Anthropic's approach. |
The vault pipeline currently flows:
/grill-me → /write-an-rfc → /write-a-prd → /prd-to-tasks → /do-work → (nothing)
↑
repeat per task
There is no agent that:
- Interacts with the built artifact as a user would
- Grades implementation against acceptance criteria from tasks.md
- Identifies gaps between spec and reality
- Sends failed implementations back for revision
The /do-work skill self-evaluates (runs lint, typecheck, tests) but this is exactly the self-evaluation bias the article warns about. The generator marks its own work as done.
The vault uses RALPH (one task, one invocation, fresh context). Anthropic's finding: with Opus 4.5+, continuous sessions outperform context resets. Their generator "ran coherently for over two hours" without sprint decomposition.
This suggests a tension: RALPH was designed for an era of higher context degradation risk. With Opus 4.6's improvements ("plans more carefully, sustains agentic tasks for longer, can operate more reliably in larger codebases"), the one-task-per-invocation constraint may be overly conservative for some task types.
However, RALPH also serves an AUDHD-friendly purpose: it produces tangible, visible progress (one commit per invocation) and prevents runaway agents. The tradeoff is not purely technical.
Sozluk system: Anthropic's harness has no shared vocabulary mechanism. When their planner generates a spec, the generator interprets terminology independently. The vault's terms.md and conventions.md create a shared language layer that reduces ambiguity across all agent invocations. This is a genuine innovation.
Career-aware metadata: The matrix-aligned tagging system (#results/impact, #craft/architecture, etc.) is a meta-layer that doesn't exist in any published harness design. It converts development work into structured evidence for performance reviews. This is outside Anthropic's scope but uniquely valuable for individual contributors.
Backfill chains: The mandatory recursive backfill (quarterly → monthly → weekly → daily → session) ensures complete reporting hierarchies. Anthropic's progress.txt is flat; the vault's tiered rollup is more sophisticated.
Project discovery: The sozluk's repo-resolution mechanism (realpath matching, git remote fallback) enables cross-project skills without explicit configuration. Anthropic's harness is single-project.
What: A /qa skill that runs after /do-work completes a task. Domain-agnostic: one skill, multiple verification backends.
How it would work:
- Read the task's acceptance criteria from
tasks.md - Read the implementation diff (git diff of the task's commit)
- Detect verification domain from task description and changed files, then use the appropriate backend:
- Mobile (iOS/Android): iOS simulator MCP. Tap through UI, inspect a11y tree, verify navigation and visual state.
- Web: Playwright MCP. Click through pages, check DOM, verify interactions.
- Backend/CLI: Run commands, hit API endpoints, verify output and database state.
- Pure code (refactors, type changes, config): Review diff against acceptance criteria, run tests, check types via LSP. No interactive verification needed.
- Grade the implementation against each acceptance criterion with hard pass/fail
- Output a report. Human decides next action (no auto-revert, no auto-fix in v1)
Key design decision: the evaluator must be a separate agent invocation, not a step within /do-work. This prevents the self-evaluation bias. The evaluator has no sunk-cost attachment to the implementation.
Calibration challenge: Per the article, Claude is a poor QA agent out of the box. The skill prompt will need iterative refinement. Start with strict criteria and relax over time, not the reverse. Plan for 3-4 iterations on real completed tasks before grades are reliable.
When to use: Not every task needs QA. Config changes, migrations, and trivial fixes don't benefit. UI features, complex logic, and anything with user-facing behavior do. The skill should accept an optional --skip-qa flag, but default to running.
What: Before /do-work begins implementation, a lightweight negotiation step where the agent proposes what "done" looks like and the human (or evaluator) confirms.
How to implement:
- Add a
## Contractsection to each task intasks.md(optional, for complex tasks) /do-workreads the contract. If none exists, it drafts one and pauses for confirmation before proceeding- The contract specifies: what files will change, what the user should see, how to verify
This is lighter than full sprint contracts (no separate evaluator agent at this stage) but captures the core value: explicit agreement on "done" before work begins.
What: An optional /do-sprint skill that picks 2-3 related tasks and implements them in a single session, relying on Opus 4.6's improved sustained task execution.
When: Tasks that are tightly coupled (e.g., "add API endpoint" + "add UI for endpoint" + "add tests for endpoint") benefit from shared context. RALPH forces context rebuilding between each, losing implementation-specific knowledge.
Safeguards:
- Cap at 3 tasks per sprint
- Commit after each task (not one mega-commit)
- Update progress.md after each task (not batched)
- If any task fails typecheck/lint, stop the sprint
Keep RALPH as default. Sprint mode is opt-in for power users who understand the tradeoff.
Why: Anthropic specifically chose JSON for feature lists because "models are less likely to inappropriately modify" structured data formats. The vault uses Markdown with checkbox syntax (- [ ]), which is human-friendly but agent-editable in unintended ways.
Counter-argument: The vault tasks are Obsidian-native. JSON breaks the Obsidian experience. Markdown checkboxes are the right UX for a human-in-the-loop workflow.
Compromise: Keep Markdown for human readability. Add a frontmatter schema_version field. Have /do-work validate task structure before operating (e.g., confirm checkboxes haven't been silently removed).
What: Apply the article's stress-testing principle to the vault skills themselves. Quarterly, review each skill and ask: "Is this complexity still justified, or has the model improved enough to simplify?"
Concrete candidates for Opus 4.6 era:
- Sozluk auto-loading: Currently every skill has copy-pasted sozluk-reading instructions. Could this be a single shared preamble or rule file instead?
- Backfill chains: If Opus 4.6 can generate higher-quality reports from raw data, do we need the mandatory session → daily → weekly chain? Or can we skip levels?
- Progress.md structure: The structured frontmatter (commit hash, key files, decisions, gotchas, "Next") was designed for less capable models. Can Opus 4.6 extract this context from git log alone?
- TDD gating:
/do-workchecks if a task is TDD-annotated. With better models, should TDD be the default rather than an annotation?
How to practice: Create a /harness-audit skill or just a quarterly checklist in the vault. For each skill component, temporarily disable it and evaluate output quality. One component at a time.
The vault already uses file-based communication (progress.md, tasks.md). Opportunities to strengthen:
Structured handoff files: Instead of free-text "Next" fields in progress.md, use a machine-readable format:
## Handoff
- **Status**: complete | partial | blocked
- **Blocked by**: (optional: issue, dependency, question)
- **Next action**: (one concrete sentence)
- **Context files**: (list of files the next agent should read first)
- **Warnings**: (anything surprising discovered during implementation)Cross-feature communication: Currently features are isolated silos. If task 3 of feature A discovers something relevant to feature B, there's no mechanism to propagate that. A shared vault/signals/ directory for cross-cutting observations could address this.
The article emphasizes that QA calibration requires multiple rounds. Build this into the workflow:
- After each
/qarun, the human reviews whether they agree with the evaluator's judgment - Disagreements (evaluator passed something human would fail, or vice versa) are recorded
- Periodically, these disagreements are used to refine the evaluator prompt
- Track calibration accuracy over time (simple agree/disagree ratio)
This is the same loop Anthropic used ("several rounds of this development loop before the evaluator was grading in a way that I found reasonable") but made systematic.
This principle from Anthropic's "Building Effective Agents" post applies directly. The vault skill suite has grown to 20+ skills. Each was justified at creation, but the aggregate complexity creates its own costs: maintenance burden, interaction ambiguity (which skill handles what?), and context overhead (skill descriptions consume tokens).
Action: periodically evaluate whether skills can be merged, simplified, or retired.
Every vault skill was written for a specific model capability level. As models improve, some skills become scaffolding around problems the model can now solve natively. The RALPH model, for instance, assumed context degradation. Sprint contracts assumed the model couldn't self-scope. The sozluk assumed the model couldn't maintain consistent terminology.
Action: for each skill, document what model limitation it compensates for. When that limitation is resolved, simplify.
The expanding possibility space means the vault pipeline can attempt more ambitious features over time. Features that would have required 20+ tasks with careful decomposition might be achievable in 5 tasks with a more capable model. The task decomposition granularity should scale with model capability.
The single strongest takeaway: never trust an agent's evaluation of its own work for subjective or complex tasks. This principle should be embedded deeply into the vault's design philosophy. Wherever a skill currently self-evaluates (marks its own output as complete, grades its own quality), consider whether a separate evaluation step would improve reliability.
Ordered by impact-to-effort ratio:
-
Create
/qaskill (high impact, medium effort). The biggest gap. Domain-agnostic with multiple verification backends: iOS simulator MCP for mobile, Playwright MCP for web, command execution for backend/CLI, diff review + LSP + tests for pure code. Detects domain from task description and changed files. v1 scope: read criteria, verify via appropriate backend, grade pass/fail, output report. No auto-revert or auto-fix. -
Add structured handoff format to progress.md (high impact, low effort). Replace free-text "Next" with machine-readable fields. Update
/do-workto write this format. -
Create
/harness-auditskill (medium impact, medium effort). Quarterly audit skill that walks each SKILL.md, checks for## Model Assumptionssections, evaluates whether assumptions are obsolete given current model capabilities, identifies copy-pasted instructions that could be consolidated into shared rules, checks vault sessions for skill usage frequency, and produces a report with: skills to simplify, assumptions to retire, consolidation opportunities, and skills to potentially sunset. Report ends with ONE concrete next action (AUDHD-friendly). Saves tovault/checklists/YYYY-MM-DD-harness-audit.md. -
Prototype
/do-sprint(medium impact, medium effort). Try 2-3 task sessions on a real feature. Measure quality vs RALPH. Keep RALPH as default. -
Consolidate sozluk loading (medium impact, medium effort). Move the copy-pasted sozluk instructions to a shared rules file. Reduces skill maintenance burden.
-
Cross-feature signals directory (low impact, low effort). Create
vault/signals/. See if it gets used organically. -
Evaluator calibration tracking (medium impact, high effort). Requires a feedback loop that doesn't exist yet. Defer until
/qaskill proves useful.
-
Should the evaluator agent use a different model than the generator? The article doesn't discuss this, but model diversity might reduce shared blind spots.
-
How does the QA skill interact with the AUDHD workflow? If
/do-workfinishes and the user runs/qaand it fails, the user now has a revert-or-fix decision. Does this create decision paralysis? Should/qaauto-fix small issues? -
Resolved:/qauses multiple verification backends (simulator, Playwright, CLI, diff+LSP) selected by domain detection. -
Is the vault's two-tier reporting (local vault + Notion) still justified, or should one tier be retired? The article's simplification principle suggests maintaining both is a smell unless they serve genuinely different audiences.
-
Sprint contracts add a human confirmation step. In the AUDHD context, is this helpful (forces clarity) or harmful (adds friction that prevents starting)?
-
The article notes that improvement across evaluator iterations was "not always linear" and humans sometimes preferred middle iterations. For the vault, this means the QA skill shouldn't assume monotonic improvement. How do we handle the case where revision N is worse than revision N-1?