Harness Design for Long-Running Apps: Research & Vault Workflow Analysis

Date: 2026-03-24 Source: Anthropic Engineering Blog (Prithvi Rajasekaran, 2026-03-24) Supporting articles: Effective Harnesses, Context Engineering Purpose: Extract actionable patterns from Anthropic's harness research and map them against the vault workflow suite to identify gaps and improvements.

Part 1: Core Concepts from the Article Series

1.1 The GAN-Inspired Generator/Evaluator Pattern

The central breakthrough: separate the agent doing the work from the agent judging it. This addresses two failure modes:

Context degradation: Models lose coherence as context fills. Some models exhibit "context anxiety" and prematurely wrap up work near perceived context limits.
Self-evaluation bias: Agents confidently praise their own mediocre output. Tuning a standalone evaluator is far more tractable than making a generator critical of its own work.

The pattern scales from two agents (frontend design) to three agents (full-stack apps):

Planner  →  Generator  →  Evaluator
   │            │              │
   │            │              ├── Uses Playwright MCP to interact like a user
   │            │              ├── Grades against explicit criteria with hard thresholds
   │            │              └── Failure on ANY criterion fails the entire sprint
   │            │
   │            ├── Works one feature at a time
   │            ├── Self-evaluates before QA handoff
   │            └── Has git access for version control
   │
   ├── Converts 1-4 sentence prompts into full specs
   ├── Prompted to be ambitious about scope
   └── Focuses on product context, NOT granular implementation

1.2 Sprint Contracts

Before coding begins, generator and evaluator negotiate "done" criteria for each work chunk. The generator proposes what it will build and how success is verified. The evaluator reviews and iterates until agreement. This keeps work faithful to the spec without over-specifying implementation.

1.3 File-Based Communication Protocol

Agents communicate via files. One agent writes; another reads and responds within that file or creates new files. No shared memory, no message passing. Files are the interface.

1.4 The Assumption Stress-Testing Principle

"Every component in a harness encodes an assumption about what the model can't do on its own, and those assumptions are worth stress testing, both because they may be incorrect, and because they can quickly go stale as models improve."

Practical approach: remove one component at a time, review impact. Radical cuts degrade performance. Methodical pruning reveals which pieces are load-bearing.

1.5 Evaluator Necessity as a Spectrum

The evaluator is not a fixed yes-or-no decision. For tasks within the model's natural capability boundary, evaluator overhead is unnecessary. For tasks at the capability edge, evaluator provides real value. This is dynamic: model improvements shift the boundary.

1.6 Context Management Evolution

Era	Strategy	Why
Pre-Opus 4.5	Context resets (clear window, start fresh with handoffs)	Sonnet exhibited "context anxiety," prematurely wrapping up
Opus 4.5	Continuous sessions with SDK compaction	Eliminated context anxiety, reduced orchestration overhead
Opus 4.6	Simplified harness (no sprint decomposition needed)	Model plans more carefully, sustains tasks longer

1.7 QA Agent Calibration

"Out of the box, Claude is a poor QA agent."

The evaluator initially identifies real issues then talks itself into deciding they're not a big deal. Solution: iterative prompting refinement based on divergence between evaluator judgment and human standards. Took several rounds before grading was reasonable. Even then, subtle bugs and unintuitive interactions slipped through.

1.8 Quantified Results

Approach	Duration	Cost	Quality
Solo agent	20 min	$9	Broken core feature, rigid layout, wasted space
Full harness	6 hours	$200	16-feature spec across 10 sprints, functional physics, integrated AI
Simplified (Opus 4.6) DAW	3h 50m	$124.70	Working arrangement view, mixer, transport, agent-driven composition

Part 2: Foundational Patterns from Supporting Articles

2.1 The Shift Worker Problem (Effective Harnesses)

Long-running agents are like engineers working in shifts where each new engineer has no memory. Two failure modes:

Over-ambition: agent tries to one-shot everything, runs out of context mid-implementation
Premature completion: later agent sees partial progress and declares victory

Solution architecture:

Initializer agent (first session): creates init.sh, progress file, feature list (200+ items in JSON), initial commit
Coding agent (all subsequent): reads progress, picks one feature, implements, verifies, commits, updates progress

Key detail: feature list is JSON (not Markdown) because models are less likely to inappropriately modify JSON.

2.2 Clean State Emphasis

Code must always be "appropriate for merging to a main branch." No major bugs, well-ordered, well-documented. Each session leaves the codebase in a state where new work can begin without cleanup.

2.3 Context Engineering Principles

Context rot: accuracy decreases as token count increases (n² pairwise relationships in transformers)
Just-in-time retrieval: maintain lightweight identifiers (file paths, queries), dynamically retrieve via tools
Compaction: summarize conversations nearing limits, preserve architectural decisions, discard redundant tool outputs
Structured note-taking: agents write persistent notes outside context window, retrieve later
Sub-agent architecture: focused agents explore extensively (tens of thousands of tokens) but return condensed summaries (1-2k tokens)

Part 3: Mapping to the Vault Workflow

3.1 Structural Correspondence

Anthropic Concept	Vault Equivalent	Gap?
Planner Agent	`/write-a-prd` + `/prd-to-tasks`	No. Strong match. PRD interview + task decomposition mirrors planner's spec generation.
Generator Agent	`/do-work`	Partial. RALPH model (one task per invocation) is more conservative than continuous multi-sprint generator.
Evaluator Agent	Missing	Yes. No dedicated QA/evaluation step. `/grill-me` is pre-implementation stress testing, not post-implementation QA.
Sprint Contracts	Missing	Yes. No pre-work negotiation of "done" criteria between generator and evaluator.
File-Based Communication	`progress.md`, `tasks.md`, `prd.md`	Strong match. The vault's file-first approach is already this pattern.
Feature List (JSON)	`tasks.md` (Markdown)	Partial. Markdown is more editable but also more susceptible to inappropriate modification.
Context Resets	RALPH model	Intentional match. Fresh context per `/do-work` invocation.
Compaction	Claude SDK auto-compaction	Implicit. Not explicitly managed by vault skills.
Structured Note-Taking	`progress.md`	Strong match. "Next" field in progress entries is specifically designed for agent handoff.
Assumption Stress-Testing	Not practiced	Gap. No periodic review of whether skill complexity is still justified by model limitations.
Sozluk (shared language)	Vault-unique	Advantage. No equivalent in Anthropic's harness. Shared terminology reduces ambiguity.
Matrix-aligned tagging	Vault-unique	Advantage. Career-aware metadata not present in Anthropic's approach.

3.2 The Critical Gap: No Post-Implementation Evaluator

The vault pipeline currently flows:

/grill-me → /write-an-rfc → /write-a-prd → /prd-to-tasks → /do-work → (nothing)
                                                                ↑
                                                          repeat per task

There is no agent that:

Interacts with the built artifact as a user would
Grades implementation against acceptance criteria from tasks.md
Identifies gaps between spec and reality
Sends failed implementations back for revision

The /do-work skill self-evaluates (runs lint, typecheck, tests) but this is exactly the self-evaluation bias the article warns about. The generator marks its own work as done.

3.3 The RALPH Model vs. Continuous Sessions

The vault uses RALPH (one task, one invocation, fresh context). Anthropic's finding: with Opus 4.5+, continuous sessions outperform context resets. Their generator "ran coherently for over two hours" without sprint decomposition.

This suggests a tension: RALPH was designed for an era of higher context degradation risk. With Opus 4.6's improvements ("plans more carefully, sustains agentic tasks for longer, can operate more reliably in larger codebases"), the one-task-per-invocation constraint may be overly conservative for some task types.

However, RALPH also serves an AUDHD-friendly purpose: it produces tangible, visible progress (one commit per invocation) and prevents runaway agents. The tradeoff is not purely technical.

3.4 Where the Vault is Ahead

Sozluk system: Anthropic's harness has no shared vocabulary mechanism. When their planner generates a spec, the generator interprets terminology independently. The vault's terms.md and conventions.md create a shared language layer that reduces ambiguity across all agent invocations. This is a genuine innovation.

Career-aware metadata: The matrix-aligned tagging system (#results/impact, #craft/architecture, etc.) is a meta-layer that doesn't exist in any published harness design. It converts development work into structured evidence for performance reviews. This is outside Anthropic's scope but uniquely valuable for individual contributors.

Backfill chains: The mandatory recursive backfill (quarterly → monthly → weekly → daily → session) ensures complete reporting hierarchies. Anthropic's progress.txt is flat; the vault's tiered rollup is more sophisticated.

Project discovery: The sozluk's repo-resolution mechanism (realpath matching, git remote fallback) enables cross-project skills without explicit configuration. Anthropic's harness is single-project.

Part 4: Concrete Improvement Opportunities

4.1 Add a QA/Evaluator Skill

What: A /qa skill that runs after /do-work completes a task. Domain-agnostic: one skill, multiple verification backends.

How it would work:

Read the task's acceptance criteria from tasks.md
Read the implementation diff (git diff of the task's commit)
Detect verification domain from task description and changed files, then use the appropriate backend:
- Mobile (iOS/Android): iOS simulator MCP. Tap through UI, inspect a11y tree, verify navigation and visual state.
- Web: Playwright MCP. Click through pages, check DOM, verify interactions.
- Backend/CLI: Run commands, hit API endpoints, verify output and database state.
- Pure code (refactors, type changes, config): Review diff against acceptance criteria, run tests, check types via LSP. No interactive verification needed.
Grade the implementation against each acceptance criterion with hard pass/fail
Output a report. Human decides next action (no auto-revert, no auto-fix in v1)

Key design decision: the evaluator must be a separate agent invocation, not a step within /do-work. This prevents the self-evaluation bias. The evaluator has no sunk-cost attachment to the implementation.

Calibration challenge: Per the article, Claude is a poor QA agent out of the box. The skill prompt will need iterative refinement. Start with strict criteria and relax over time, not the reverse. Plan for 3-4 iterations on real completed tasks before grades are reliable.

When to use: Not every task needs QA. Config changes, migrations, and trivial fixes don't benefit. UI features, complex logic, and anything with user-facing behavior do. The skill should accept an optional --skip-qa flag, but default to running.

4.2 Introduce Sprint Contracts for Complex Tasks

What: Before /do-work begins implementation, a lightweight negotiation step where the agent proposes what "done" looks like and the human (or evaluator) confirms.

How to implement:

Add a ## Contract section to each task in tasks.md (optional, for complex tasks)
/do-work reads the contract. If none exists, it drafts one and pauses for confirmation before proceeding
The contract specifies: what files will change, what the user should see, how to verify

This is lighter than full sprint contracts (no separate evaluator agent at this stage) but captures the core value: explicit agreement on "done" before work begins.

4.3 Consider Relaxing RALPH for Multi-Task Sprints

What: An optional /do-sprint skill that picks 2-3 related tasks and implements them in a single session, relying on Opus 4.6's improved sustained task execution.

When: Tasks that are tightly coupled (e.g., "add API endpoint" + "add UI for endpoint" + "add tests for endpoint") benefit from shared context. RALPH forces context rebuilding between each, losing implementation-specific knowledge.

Safeguards:

Cap at 3 tasks per sprint
Commit after each task (not one mega-commit)
Update progress.md after each task (not batched)
If any task fails typecheck/lint, stop the sprint

Keep RALPH as default. Sprint mode is opt-in for power users who understand the tradeoff.

4.4 Switch tasks.md to JSON (or Hybrid)

Why: Anthropic specifically chose JSON for feature lists because "models are less likely to inappropriately modify" structured data formats. The vault uses Markdown with checkbox syntax (- [ ]), which is human-friendly but agent-editable in unintended ways.

Counter-argument: The vault tasks are Obsidian-native. JSON breaks the Obsidian experience. Markdown checkboxes are the right UX for a human-in-the-loop workflow.

Compromise: Keep Markdown for human readability. Add a frontmatter schema_version field. Have /do-work validate task structure before operating (e.g., confirm checkboxes haven't been silently removed).

4.5 Periodic Harness Simplification Audits

What: Apply the article's stress-testing principle to the vault skills themselves. Quarterly, review each skill and ask: "Is this complexity still justified, or has the model improved enough to simplify?"

Concrete candidates for Opus 4.6 era:

Sozluk auto-loading: Currently every skill has copy-pasted sozluk-reading instructions. Could this be a single shared preamble or rule file instead?
Backfill chains: If Opus 4.6 can generate higher-quality reports from raw data, do we need the mandatory session → daily → weekly chain? Or can we skip levels?
Progress.md structure: The structured frontmatter (commit hash, key files, decisions, gotchas, "Next") was designed for less capable models. Can Opus 4.6 extract this context from git log alone?
TDD gating: /do-work checks if a task is TDD-annotated. With better models, should TDD be the default rather than an annotation?

How to practice: Create a /harness-audit skill or just a quarterly checklist in the vault. For each skill component, temporarily disable it and evaluate output quality. One component at a time.

4.6 File-Based Agent Communication Improvements

The vault already uses file-based communication (progress.md, tasks.md). Opportunities to strengthen:

Structured handoff files: Instead of free-text "Next" fields in progress.md, use a machine-readable format:

## Handoff
- **Status**: complete | partial | blocked
- **Blocked by**: (optional: issue, dependency, question)
- **Next action**: (one concrete sentence)
- **Context files**: (list of files the next agent should read first)
- **Warnings**: (anything surprising discovered during implementation)

Cross-feature communication: Currently features are isolated silos. If task 3 of feature A discovers something relevant to feature B, there's no mechanism to propagate that. A shared vault/signals/ directory for cross-cutting observations could address this.

4.7 Evaluator Prompt Calibration Loop

The article emphasizes that QA calibration requires multiple rounds. Build this into the workflow:

After each /qa run, the human reviews whether they agree with the evaluator's judgment
Disagreements (evaluator passed something human would fail, or vice versa) are recorded
Periodically, these disagreements are used to refine the evaluator prompt
Track calibration accuracy over time (simple agree/disagree ratio)

This is the same loop Anthropic used ("several rounds of this development loop before the evaluator was grading in a way that I found reasonable") but made systematic.

Part 5: Architectural Principles to Adopt

5.1 "Find the simplest solution possible, increase complexity only when needed"

This principle from Anthropic's "Building Effective Agents" post applies directly. The vault skill suite has grown to 20+ skills. Each was justified at creation, but the aggregate complexity creates its own costs: maintenance burden, interaction ambiguity (which skill handles what?), and context overhead (skill descriptions consume tokens).

Action: periodically evaluate whether skills can be merged, simplified, or retired.

5.2 "Harnesses encode assumptions about model limitations"

Every vault skill was written for a specific model capability level. As models improve, some skills become scaffolding around problems the model can now solve natively. The RALPH model, for instance, assumed context degradation. Sprint contracts assumed the model couldn't self-scope. The sozluk assumed the model couldn't maintain consistent terminology.

Action: for each skill, document what model limitation it compensates for. When that limitation is resolved, simplify.

5.3 "The better the models get, the more space there is for complex tasks"

The expanding possibility space means the vault pipeline can attempt more ambitious features over time. Features that would have required 20+ tasks with careful decomposition might be achievable in 5 tasks with a more capable model. The task decomposition granularity should scale with model capability.

5.4 Separation of Concerns Over Self-Evaluation

The single strongest takeaway: never trust an agent's evaluation of its own work for subjective or complex tasks. This principle should be embedded deeply into the vault's design philosophy. Wherever a skill currently self-evaluates (marks its own output as complete, grades its own quality), consider whether a separate evaluation step would improve reliability.

Part 6: Prioritized Action Items

Ordered by impact-to-effort ratio:

Create /qa skill (high impact, medium effort). The biggest gap. Domain-agnostic with multiple verification backends: iOS simulator MCP for mobile, Playwright MCP for web, command execution for backend/CLI, diff review + LSP + tests for pure code. Detects domain from task description and changed files. v1 scope: read criteria, verify via appropriate backend, grade pass/fail, output report. No auto-revert or auto-fix.
Add structured handoff format to progress.md (high impact, low effort). Replace free-text "Next" with machine-readable fields. Update /do-work to write this format.
Create /harness-audit skill (medium impact, medium effort). Quarterly audit skill that walks each SKILL.md, checks for ## Model Assumptions sections, evaluates whether assumptions are obsolete given current model capabilities, identifies copy-pasted instructions that could be consolidated into shared rules, checks vault sessions for skill usage frequency, and produces a report with: skills to simplify, assumptions to retire, consolidation opportunities, and skills to potentially sunset. Report ends with ONE concrete next action (AUDHD-friendly). Saves to vault/checklists/YYYY-MM-DD-harness-audit.md.
Prototype /do-sprint (medium impact, medium effort). Try 2-3 task sessions on a real feature. Measure quality vs RALPH. Keep RALPH as default.
Consolidate sozluk loading (medium impact, medium effort). Move the copy-pasted sozluk instructions to a shared rules file. Reduces skill maintenance burden.
Cross-feature signals directory (low impact, low effort). Create vault/signals/. See if it gets used organically.
Evaluator calibration tracking (medium impact, high effort). Requires a feedback loop that doesn't exist yet. Defer until /qa skill proves useful.

Part 7: Open Questions

Should the evaluator agent use a different model than the generator? The article doesn't discuss this, but model diversity might reduce shared blind spots.
How does the QA skill interact with the AUDHD workflow? If /do-work finishes and the user runs /qa and it fails, the user now has a revert-or-fix decision. Does this create decision paralysis? Should /qa auto-fix small issues?
~~Resolved: /qa uses multiple verification backends (simulator, Playwright, CLI, diff+LSP) selected by domain detection.~~
Is the vault's two-tier reporting (local vault + Notion) still justified, or should one tier be retired? The article's simplification principle suggests maintaining both is a smell unless they serve genuinely different audiences.
Sprint contracts add a human confirmation step. In the AUDHD context, is this helpful (forces clarity) or harmful (adds friction that prevents starting)?
The article notes that improvement across evaluator iterations was "not always linear" and humans sometimes preferred middle iterations. For the vault, this means the QA skill shouldn't assume monotonic improvement. How do we handle the case where revision N is worse than revision N-1?

usirin/2026-03-24-harness-design-research.md

Select an option

No results found