Skip to content

Instantly share code, notes, and snippets.

@planetf1
Last active June 3, 2026 15:02
Show Gist options
  • Select an option

  • Save planetf1/ad363acede45780fb486cd43e4e7728e to your computer and use it in GitHub Desktop.

Select an option

Save planetf1/ad363acede45780fb486cd43e4e7728e to your computer and use it in GitHub Desktop.
Mellea Concept to Implementation process / ideation — one-pager + strawman + research (team session 2026-06-03)

The breadth: what we already have

Summary — One laptop carries ~50 agent skills. The ~14 that market mellea are already shared in a git repo; the ~20 that build it are siloed on individual machines, copied nowhere. This page inventories both tiers and proposes a one-off batch to consolidate the engineering skills into the shared library (the how is on page 2).

The split is the whole argument:

Skills that market the project are already shared. Skills that build it are siloed on individual machines.

(Skills below are grouped as they'd sit after refactoring, ~50 on one laptop.)

flowchart LR
    subgraph SHARED["Already shared (git repo) ✓"]
    C["Content / dev-rel<br/>14 skills"]
    end
    subgraph SILOED["Siloed on laptops ✗"]
    E["Engineering<br/>~20 skills"]
    end
    C -.->|"the asymmetry"| E
    style SHARED fill:#dff0d8,stroke:#3c763d,color:#1b3a1b
    style SILOED fill:#f2dede,stroke:#a94442,color:#3a1b1b
Loading

The team repo today — and why it isn't enough yet

The shared content repo hosts 14 skills (blog / tweet / LinkedIn / YouTube drafting, release notes, research, link previews, snippet validation). Consumed by symlinkzero drift, your copy is the base. This proves the model. Two cracks show it isn't yet a team library:

  • Single-author — one person feeds it, others only read. A library with one contributor is a personal repo with an audience.
  • Stale — last commit ~7 weeks ago. No inflow, no refresh — the exact drift the learning loop (page 2) is built to kill.

Symlink-sharing works for a solo author. It does not scale to many people editing the same skill — that's what the tiered override / augment model (page 2) is for.

The engineering tier — siloed, never shared

Local directories on one laptop, shared with no one. Grouped by phase so adoption order is obvious. Right column = rough effort to generalise before a skill is team-ready (estimates — validate before quoting):

Phase Skills Est. effort to share
Implement fix-bug, writing-tests, stacked-pr, phased-pr-strategy low (~20%)
Review code-review, write-pr-body, respond-to-pr-comment low (~15%)
Design design-issue, design-proposal, prior-art-research, project-scoping, new-project, project-bootstrap, backlog-decomposition medium
Cross-cutting async-concurrency, llm-inference, opentelemetry-tracing, rust-async-background-tasks, testing-generative-ai, validate-snippets medium–high

Focus first: fix-bug, code-review, write-pr-body — highest frequency, lowest effort, so the loop gets exercised fastest.

The rest of the inventory

  • Project tier — repo-specific (status / standup, project logging, marker auditing, the skill-authoring meta-skill). Live in the project repo.
  • Personal tier — machine / other-project / taste (system cleanup, SSH helpers, GPU sizing, release tooling). Stay personal.

Full per-skill list with descriptions: page 1b — personal catalog.

Consolidating into a shared library — the mechanism

One laptop holds ~20 engineering skills. Multiply by the team and that tier is the real prize — but the contributions must be consolidated, validated, and hosted first. The bootstrap, reusing the tiered model from page 2:

  1. List first — everyone drops their skills into a shared table (name, purpose, owner, share? y/n). Surfaces overlaps cheaply — five people will each have a fix-bugbefore anyone writes a PR.
  2. Consolidate — per overlapping skill, pick one seed author; the rest contribute by review / augment, not competing PRs.
  3. Stand up the team-base repo (or widen the content repo with an eng set) as the Team base tier.
  4. One PR per skilldescription: frontmatter mandatory so selection works. A rotating gardener validates (generic, no per-laptop paths, cruft stripped), bumps the version date, merges.
  5. Add the repo to your skill path and pull. Local tweaks become overrides; improvements flow back as the next PR.

A one-off batch to seed the base. After it lands, it's the steady-state loop from page 2 — not a migration project.

Personal skill catalog (my laptop)

Summary — I have a lot of skills — ~50 on this laptop, ~40 of them mellea-relevant, all listed below with descriptions and a status column. The point isn't any single skill; it's the sheer volume, and that almost every engineering one is Personal — shared with no one.

This is my own inventory, not a proposal. A point-in-time snapshot of the skills on one machine that touch mellea / core development — reference for the breadth on page 1. Machine, other-project, and taste skills (firewall tooling, disk cleanup, SSH helpers, GPU sizing, etc.) are omitted.

Descriptions are condensed from each skill's description: frontmatter (the text the agent matches against to decide whether a skill fires). The Status column is the page-1 story made concrete — almost everything engineering is Personal, i.e. shared with no one.

Status legend

  • Team — unchanged — symlinked from the shared content repo; my copy is the base.
  • Project — unchanged — symlinked from mellea's committed .agents/skills/.
  • Personal — additional — real dir on my laptop only, in no shared repo. The consolidation candidates (page 1).
  • Overlaid — a local override of a shared skill. None today — symlinked shared skills can't diverge; this state only appears once the team-base repo exists.

Implement

Skill What it does Status
fix-bug Structured diagnose-and-fix: rebase discipline, minimal fix, regression test, related-issue check. Personal — additional
writing-tests Tests that catch real regressions, not coverage padding — what to test, at what level, mock + marker discipline. Personal — additional
stacked-pr Manage PRs built on unreleased changes: branch construction, rebase lifecycle, focused review of stacked diffs. Personal — additional
phased-pr-strategy Decide phased small PRs vs one monolith — and defend the choice when a reviewer pushes back. Personal — additional

Review

Skill What it does Status
code-review Multi-perspective review via 3 independent subagents synthesised into a consensus report. Personal — additional
write-pr-body Narrative-first PR body a cold reviewer grasps in 30s; epic-anchored "Where this fits" + testing block. Personal — additional
respond-to-pr-comment Triage and reply to review comments (yours or others'): claim verification, thread-resolution discipline. Personal — additional

Design & planning

Skill What it does Status
design-issue Draft / review / file a design or epic GitHub issue with correct project + iteration metadata. Personal — additional
design-proposal Long-form design doc / RFC for a cross-cutting change needing agreement before decomposition. Personal — additional
prior-art-research Competitive-landscape + best-practice scan before tech choices; includes "reasons not to build this". Personal — additional
project-scoping Turn a rough idea into testable UC / TR / IC requirements plus an explicit out-of-scope list. Personal — additional
new-project Orchestrate new-project bootstrap: scoping → prior-art → scaffold → backlog. Personal — additional
project-bootstrap Stand up an empty repo: workspace, CI/CD, pre-commit, docs, task tracking, green hello-world. Personal — additional
backlog-decomposition Break phases into agent-ready sub-tasks with dependencies and priorities (a bd tracker). Personal — additional
iteration-planner Plan or reassess an iteration from issues, board, and activity; prioritise by impact and narrative. Personal — additional

Cross-cutting engineering

Skill What it does Status
async-concurrency Strict async-safety for Python asyncio and Rust tokio: event-loop blocking, pool sizing, task spawning. Personal — additional
rust-async-background-tasks Safe low-CPU tokio background tasks: enabled-flag guard, watcher skip cache, semaphore-in-spawn. Personal — additional
llm-inference Resilient calls to local / remote inference servers: structured payloads, backoff, 429/5xx handling. Personal — additional
opentelemetry-tracing OTel spans + metrics across layers, focused on LLM events, token usage, and latency. Personal — additional
testing-generative-ai Test strategy for LLM apps: three-tier model, cassettes vs live, semantic assertions, framework-vs-app boundary. Personal — additional

Mellea-specific / docs

Skill What it does Status
mellea-iteration-status Meeting-ready status of my current-iteration items on the Mellea Deliverables board. Project — unchanged
mellea-1on1-prep Manager 1:1 brief: sprint board + journal context + optional YTD contribution narrative. Personal — additional
mintlify-docs Generate Mintlify Markdown / MDX from Python / Rust docstrings, plans, and app logic. Personal — additional

Workflow / meta

Skill What it does Status
catchup Catch up on repo activity since your last engagement; deep-dive the high-priority diffs. Personal — additional
inbox-triage Cross-repo overview of unread GitHub notifications; surface action items before diving in. Personal — additional
find-skills Discover and install agent skills when you ask "is there a skill for X". Personal — additional

Content / dev-rel

The shared content repo is where this all started — most of these are symlinked in, so my copy is the base. Note the two stragglers (blog-idea-scout, blog-review) that I have locally but never pushed back: the asymmetry exists even inside content.

Skill What it does Status
write-technical-blog Write a good technical post about a feature or release (Stripe / GitHub / Cloudflare patterns). Team — unchanged
write-tweet High-engagement technical tweets (swyx / simonw / Vercel patterns). Team — unchanged
write-linkedin-post High-engagement LinkedIn posts for OSS releases and dev tools. Team — unchanged
write-youtube-script Short-demo and long-walkthrough video scripts for a developer audience. Team — unchanged
get-blog-candidates Rank merged PRs by blog / demo-worthiness (feat label, LoC, docs/examples touched). Team — unchanged
release-blog Score a release's PRs and draft a narrative release post. Team — unchanged
release-launch-plan Orchestrate the release content workflow: readiness → checklist → draft → validate → promote. Team — unchanged
research-project Build context on a GitHub project (README, metadata, releases, stats) for content. Team — unchanged
link-preview Generate a shareable link-preview card with snippet + Open Graph / Twitter meta. Team — unchanged
hn-scout Scan HN front page for posts with mellea integration / demo potential. Team — unchanged
validate-snippets Extract, execute, and report on fenced code blocks in docs / blogs (Python, Go, JS/TS, shell). Team — unchanged
blog-idea-scout Rank blog ideas by cross-referencing features, ecosystem releases, trends, onboarding gaps. Personal — additional
blog-review Review a blog-post PR: test every snippet in a clean env, compare to blog standards. Personal — additional

Shared Skills: a tiered library that learns

Summary — Two ideas make this work. One — tiers. Skills live in four layers — org → team → project → personal — and the nearest copy wins, so a shared base can be overridden locally without forking. Two — an update loop. Fixes found in daily use flow back to everyone, so the library improves instead of rotting. Today only our content skills are shared; engineering skills sit on individual laptops. The rest of this page is how the tiers and the loop work.

How the layering works

A skill is just a directory with a SKILL.md — YAML frontmatter plus a markdown body that is the prompt. Each tier is a real directory on the agent's skill search path:

Tier Where it lives For us (generative-computing / mellea)
Org ceiling org-managed settings generative-computing org policy — locked non-negotiables
Team base a git clone on the skill path generative-computing/agent-skills (new) + the content repo (exists)
Project <repo>/.agents/skills/, committed generative-computing/mellea.agents/skills/ (exists)
Personal ~/.claude/skills/, ~/.agents/skills/ your laptop — niche skills + overrides
  • Re-pull = git pull in the team clone.
  • Override = drop a same-named skill in a nearer tier.
  • Skills resolve most-specific-wins — the nearest copy shadows those below. The only exception is the locked tier nobody may weaken.
flowchart TB
    U["Personal — your laptop<br/>niche skills + personal overrides"]:::win
    P["Project — committed to the repo<br/>project skills + overrides of the base"]
    T["Team base — shared git repo<br/>engineering + content skills"]
    O["Org ceiling — managed policy<br/>the few skills nobody may weaken"]:::lock
    O --- T --- P --- U
    classDef win fill:#dff0d8,stroke:#3c763d,color:#1b3a1b
    classDef lock fill:#f2dede,stroke:#a94442,color:#3a1b1b
Loading

Personal shadows Project shadows Team base. The Org ceiling sits above all and cannot be overridden.

Precedence is two things, not one:

  • Which copy you get — hard, deterministic. Every tier is scanned into one registry; on a same-name collision the nearer tier wins, exactly like $PATH. We pick the order. This is what the diagram shows.
  • Whether a skill fires at all — soft, model-driven. The agent matches the task to each skill's description: frontmatter. Not in the diagram — but it's why good descriptions matter.

So shadowing decides which fix-bug you run; descriptions decide whether fix-bug runs at all.

Two ways to specialise a skill:

  • Override — same-named skill in a nearer tier; replaces the base entirely. Free — just directory shadowing.
  • Augment — a thin delta that includes a shared fragment instead of copying it. A convention you author, not automatic concatenation — but nothing to sweep back later.

Where does a skill go? One question: who needs it?

  • Every project, not tied to mellea → team base (fix-bug, code-review)
  • Only meaningful inside mellea → project (mellea-iteration-status)
  • Only you, or your twist on a shared one → personal
  • The org must not be able to weaken it → org ceiling

fix-bug shows the cascade end to end: generic version in team base; mellea commits a uv/pytest-aware override in its own repo; you can override again on your laptop. Nearest wins — and the only thing yet to be created is the team-base repo. The content repo, mellea's in-repo skills, and your home dir all exist today.

One format, every agent

Skills are authored once in .agents/skills/ using the open agentskills.io SKILL.md standard (our skill-author skill enforces it). Each tool is just a view onto that one store — no per-tool copies:

Agent Finds skills in Config needed
Claude Code ~/.claude/skills/ + project skillLocations symlink ~/.claude/skills → ~/.agents/skills; "skillLocations": [".agents/skills"] in .claude/settings.json
IBM Bob ~/.bob/skills/ symlink ~/.bob/skills → ~/.agents/skills (and .bob/skills → .agents/skills per repo)
OpenCode ~/.agents/skills/ none — auto-discovered
Copilot / VS Code project .agents/skills/ none

The canonical store at each tier is .agents/skills/; the per-tool dirs are symlinks or one config line pointing at it. This is why "sync" isn't "copying" — the agents share one source, they don't each hold a copy.

This is the Kustomize / Claude Code config cascade applied to skills — a 20-year-old config pattern, not a new invention.

How it incorporates learning

Improvements found in use flow back to everyone. Two nested loops at different speeds:

flowchart LR
    A[Use skill on real work] --> B{Friction?}
    B -- no --> A
    B -- yes --> C["improve-skill drafts a PR<br/>against the base (same session)"]
    C --> D[(Open skill PRs)]
    D --> E["Gardener merges<br/>one approval, bump version date"]
    E --> F[Everyone re-pulls]
    F --> A
Loading
  • Inner loop (continuous, individual) — hit friction using a skill, capture the fix as a PR there and then. The agent that just failed is the best-placed author.
  • Outer loop (per iteration, team) — a rotating gardener (a throughput role, not a taste authority) merges the open skill PRs and bumps the version. Everyone re-pulls.

Two health signals say the loop is alive:

  • Capture rate — did any friction become a PR this iteration? Zero means people are silently working around bad skills.
  • Staleness — median days a clone lags the base. A checkout left to drift weeks behind upstream is the exact failure this prevents.

The first step

One skill (fix-bug), two people, one full round-trip in one iteration: base → both adopt → both use on real work → first friction → PR → gardener merges → both re-pull. If that turns once, the model is proven and the rest is scaling.

The move to AI-led development

A teaser. This deserves its own meeting — please book one.

Summary — Shared skills are step one on a path from ad-hoc prompts to a process that largely runs itself: conventions that execute rather than sit in a wiki, knowledge that compounds across the team, and an autonomy slider set per task-type so agents propose and humans dispose. This page sketches the direction and asks for a follow-up; it does not try to settle it here.

Skills are our conventions made executable. Instead of a wiki page telling people to use uv, add the right commit trailer, and mark slow tests, the agent simply does it. The team's process stops depending on memory and starts running itself.

Shared skills are one early, concrete step on a longer path:

flowchart LR
    S1["Ad-hoc prompts<br/>tribal knowledge"] --> S2["Personal skills<br/>one laptop"]
    S2 --> S3["Shared library<br/>tiered + versioned + learning"]
    S3 --> S4["AI-led workflows<br/>the process runs itself"]
    style S3 fill:#fcf8e3,stroke:#8a6d3b,color:#3a2f1b
    style S4 fill:#dff0d8,stroke:#3c763d,color:#1b3a1b
Loading

We are between stages 2 and 3 today. The shared-skills proposal moves us to 3.

What "AI-led" starts to mean

  • Conventions execute, not just document. Standards are enforced by being run, not by being remembered.
  • Knowledge compounds. Every fix to a skill improves everyone's next task, not just the author's.
  • People direct; agents do the routine. We spend our judgement on what matters and hand the repetitive work to skills that already encode how we like it done.

What "AI-led" looks like, concretely — the autonomy matrix

The point isn't "automate everything." It's setting the autonomy slider per task-type: push specific, bounded work into AI proposes, human disposes, gate it, and keep human judgement on the core we know cold. A taste of the fuller picture (detail in the linked docs):

today target this iteration stretch (later)

Stage Human-led AI proposes, human disposes Agent-led
1 · Sense ★ weekly digest agent
2 · Frame ★ interview-the-author
3 · Triage ★ triage skill (comment-only) ☆ auto-label trivia
4 · Investigate ★ reproduce + locate skills
5 · Decompose (1 person) ★ shared skill + critique
6 · Implement (novel core) ★ bounded features ☆ trivia, auto-accept
7 · Review & verify ★ creator–verifier + evals
8 · Merge ★ distinct-approver + CI ☆ Tier-1 auto-merge

Shared skills are what make the column real: each one is a bounded task-type encoded so the agent can propose and a human can dispose.

Why a separate meeting

This page is deliberately thin. The bigger picture — how far AI-led development goes, what we automate next, where the human stays in the loop, and what it means for how we plan and review work — is a conversation, not a slide.

The ask: agree the shared-skills pilot now (see pages 1–2), and book a follow-up to talk through the bigger picture.

Full background: the deeper strawman and research notes sit in the same gist as this set — see the team-session gist.


Diagram note: these render from Mermaid as-is in GitHub, VS Code, and most slide tools. For a polished graphic, export from mermaid.live or hand the concept to a designer.

Concept → Implementation — the one-pager

Skim layer for Wed 2026-06-03. The argue-with layer is concept-to-impl-strawman.md; the evidence is concept-to-impl-research.md. Diagrams render in any Mermaid-aware viewer (GitHub, VS Code, Obsidian).


The idea in one breath

Today every stage of our lifecycle runs in one mode — a human drives, AI assists. The move is to push specific, bounded task-types into AI proposes / human disposes, gate them, and spend the freed attention on the work that needs a brain. Not "automate everything" — set the autonomy slider per task.

The sober caveat (METR RCT): experienced devs on familiar code went 19% slower with AI. So agents pay off on the periphery, not the core we know cold. A 415-developer survey (SPACE, Fast and Spurious) corroborates this — GenAI's speed gains get swallowed by review burden and verification load, so the gains "may be spurious." The whole strawman is about drawing that line on purpose.


The lifecycle and where each stage can move

flowchart LR
    S1[1 Sense] --> S2[2 Frame] --> S3[3 Triage] --> S4[4 Investigate]
    S4 --> S5[5 Decompose] --> S6[6 Implement] --> S7[7 Review & verify] --> S8[8 Merge & release]

    classDef human fill:#e8e8e8,stroke:#888,color:#000;
    classDef propose fill:#cfe8ff,stroke:#2b7,color:#000;
    class S5,S6,S7 human;
    class S1,S2,S3,S4,S8 propose;
Loading

Blue = stages where the strawman pushes a bounded task-type into AI-proposes this iteration. Grey = stays human-led (the off-distribution core).


The autonomy matrix

today target this iteration stretch (later)

Stage H · human-led P · AI proposes, human disposes A · agent-led
1 · Sense ★ weekly digest agent
2 · Frame ★ interview-the-author
3 · Triage ★ triage skill (comment-only) ☆ auto-label trivia
4 · Investigate ★ reproduce + locate skills
5 · Decompose (1 person) ★ shared skill + critique
6 · Implement (novel core) ★ bounded features ☆ trivia, auto-accept
7 · Review & verify ★ creator–verifier + evals
8 · Merge ★ distinct-approver + CI ☆ Tier-1 auto-merge

Autonomy belongs to the task-type, not the stage: a docs typo is async, the sampling loop is synchronous — same "Implement" row.


The loop inside every stage (vendor-neutral: spec-driven development)

flowchart LR
    A[Specify<br/>what & why] -->|gate| B[Plan<br/>how]
    B -->|gate| C[Tasks<br/>small, ordered]
    C -->|gate| D[Implement<br/>focused diffs]
    D -.->|verify fails| C
    K[(AGENTS.md<br/>constitution)] --- A
    K --- B
    K --- C
    K --- D
Loading

Same shape across GitHub Spec Kit / AWS Kiro / Tessl. We already ship the Tasks engine (m decompose); "interview-the-author" is /specify + /clarify. Golden rule: minimum spec rigour that removes ambiguity.


The harness (Thoughtworks' frame: feedforward + feedback)

flowchart LR
    FF["FEEDFORWARD — aim it<br/>AGENTS.md · skills · m decompose · specs"] --> AG((agent runs))
    AG --> FB["FEEDBACK — catch it<br/>ruff · mypy · m eval · pytest tiers"]
    FB -.->|self-correct before human| AG
    FB --> H[human review]
Loading

Both halves already exist in Mellea. The work is wiring them into a loop.


Merge gate (the one-directional industry consensus)

flowchart TD
    PR[Agent opens PR] --> CI{CI gates pass?}
    CI -->|no| FIX[back to author]
    CI -->|yes| APV{Independent approver?<br/>author ≠ approver}
    APV -->|no| WAIT[blocked]
    APV -->|yes| DEL{Deletes code?}
    DEL -->|yes| EXTRA[extra scrutiny]
    DEL -->|no| MERGE[merge]
    EXTRA --> MERGE
Loading

All native GitHub branch-protection — free on a public repo. GitHub: "the developer who asks the agent to open a PR cannot be the one to approve it." MSR'26: 77.5% of agentic PRs were self-merged (the dominant risk). A peer LLM library, vLLM, goes further — its AGENTS.md states "a human submitter must understand and defend the change end-to-end" (pure code-agent PRs banned).


What we actually do next

flowchart LR
    subgraph NOW[Do now - no debate]
        N1[Promote 4 skills canonical]
        N2[Vendor dev-rel-skills pinned]
        N3[fewer-permission-prompts]
        N4[.bob/mcp.json]
        N5[Branch protection]
        N6[End-of-session skill habit]
        N7[Triage external bugs 24h]
        N8[Second model on the proxy]
    end
    subgraph NEXT[Next iteration]
        X1[Triage skill]
        X2[Decomposition skill]
        X3[Reproduce + locate]
        X4[Frame-issue interview]
        X5[Pre-merge-readiness]
        X6[Commit hygiene + linters]
        X7[Skill evals]
    end
    subgraph DECIDE[Decide in the meeting]
        D1[Periphery/core line]
        D2[Decomp: shared vs specialist]
        D3[Review norm + rota SPOF]
        D4[Smallest auto-mergeable unit]
        D5[Process metrics]
        D6[Publish as case study]
    end
    NOW --> NEXT --> DECIDE
Loading
# Action Stage Effort Owner Decide
1 Promote 4 skills to project-canonical (writing-tests, code-review, respond-to-pr-comment, write-pr-body) cross-cut afternoon skills owner now
2 Vendor / submodule dev-rel-skills, commit-pinned cross-cut ½ day skills owner now
3 fewer-permission-prompts → tracked settings.json cross-cut ½ hr settings owner now
4 Wire .bob/mcp.json (git, github-ibm) cross-cut 15 min Bob users now
5 Branch protection: independent approver + same CI + provenance line 8 15 min repo admin now
6 End-of-session "improve the skill / AGENTS.md" habit cross-cut free team norm now
7 External bug reports triaged ≤ 24h + count 3 decision team norm now
8 Try a second model on the proxy (deliberate Codex/Gemini second-opinion or long-context pass, via the same LiteLLM proxy we already run Claude through) 4/7 per-PR anyone now
9 Next iteration: triage (comment-only), decomposition skill, reproduce + locate, frame-issue interview, pre-merge-readiness, commit-hygiene + linters, skill evals 2–8 + cross-cut iteration assign in meeting now
10 Decide in meeting: periphery/core line (METR+SPACE), decomp shared-vs-specialist, review norm + reviewer rota (planetf1 SPOF), smallest auto-mergeable unit or nothing (vLLM vs Zig), process metrics, publish-as-case-study cross-cut meeting

Argue about these

# Tension Evidence
1 Where's our periphery/core line — what do we actually trust to an agent here? METR: −19% on familiar core
2 Decomposition: shared skill or specialist's craft? 1 person + AI does ~all
3 Close the external-triage gap how? #775/#885/#911 sit 30–60d
4 Change the review norm to kill the long tail? 26-commit PRs = fatigue
5 Skills canonical or personal? AGENTS.md-first or vendor-skill-first? 17 silent symlinks; AGENTS.md is the open standard
6 Smallest unit an agent may merge — or nothing? Consensus: don't auto-merge unreviewed
7 Which process metrics? Baselines: TTM median 1.3d, p90 8.4d
8 Publish how a 6-person gen-computing team does this? We are the case study

Do / Don't (the spirit of it)

Do Don't
Set the slider per task-type Treat "more autonomy" as the goal
Delegate the periphery (backends, tests, docs) Hand the off-distribution core to an agent
Ground shared conventions in AGENTS.md Let everyone hand-roll their own prompts
Verify with our own evals Import a web-app DOM/Playwright harness
Enforce author ≠ approver (free) Build a bespoke governance engine
Adopt the free 90% Reinvent the expensive 10%

Mellea: Concept → Implementation — a strawman to argue with

For the team session, Wed 2026-06-03. This is a strawman, not a proposal I'm attached to. Read it to disagree with it. The goal is to leave with 2–3 things we actually change this iteration, and a shared map of where we're heading.


The one-sentence problem

We use AI heavily and well, but every stage of our lifecycle sits in the same mode — a human drives, AI assists, whoever picked up the work decides how. That's fine, and it's also the ceiling. The models are now good enough that some of this work doesn't need a human in the driving seat — and some of it needs us more, not less. We don't currently distinguish.

The thesis

Karpathy's 2026 name for where we're heading is agentic engineering: "vibe coding raised the floor; agentic engineering raises the ceiling" — using agents to go genuinely faster without dropping the quality bar ("you are still responsible for your software just as before"). The mechanism is an autonomy slider: "you are in charge of the autonomy slider, and depending on the complexity of the task at hand, you can tune the amount of autonomy that you're willing to give up." You keep the AI on a leash as tight as the task warrants (unconstrained, they get "lost in the woods") — "it's not useful to me to get a diff of 10,000 lines of code… I'm still the bottleneck." The goal is the Iron Man suit (augmentation with a fast human verification loop), not the Iron Man robot that runs off alone — "more building partial autonomy products… so that the generation-verification loop of the human is very, very fast." This isn't one lab's idea — it's the convergent industry signal. Thoughtworks' April 2026 Tech Radar names the same discipline "harness engineering": putting agents on a leash with feedforward controls that aim them (skills, specs, shared instructions) and feedback controls that catch them before a human does (linters, type-checkers, evals). Anthropic's own teams are one concrete instance — they split work into asynchronous ("auto-accept mode", the agent runs and you review the result) and synchronous ("detailed prompts… for core business logic", supervised turn by turn) — but we use them as an illustration, not the authority.

The sober part, up front. The autonomy story is real but uneven, and the best controlled evidence is a caution, not a cheer. METR's 2025 RCT put 16 experienced open-source developers on their own mature repos and measured them 19% slower with AI — while they believed they were ~20% faster. The slowdown tracked exactly the conditions we live in: high repo familiarity, large complex codebase, time spent verifying unreliable suggestions. The honest read isn't "AI makes us faster"; it's "AI pays off on the periphery and on unfamiliar code, and can cost us time on the core we know cold" — which is the whole reason the slider matters. This isn't just our anecdote: SPACE's measured "Fast and Spurious" survey (415 practitioners) found GenAI's speed and output gains are offset by a heavier code-review burden and the cognitive load of verifying output — "gains may be spurious." And there's a skill-atrophy signal to watch — Anthropic's RCT found AI-assisted engineers scored 50% vs 67% on a comprehension quiz (d=0.738). DORA 2025 says it more bluntly: AI is "an amplifier" — it magnifies the discipline (or the mess) we already have.

So picture our lifecycle as a grid: rows are the stages from "notice a need" to "release"; the columns are the slider — who drives:

  • H — Human-led, AI assists. You drive. AI is a power tool. Where we all are today, for everything.
  • P — AI proposes, human disposes. The agent produces a candidate — a triaged issue, a decomposition, a review — and a human approves it through a gate. The human's job shrinks to judgement.
  • A — Agent-led (async). The agent runs to completion on a tightly bounded class of work; the human reviews output, or a rule lets it through.

The move is not "automate everything". It's to push specific, bounded task-types leftward into P, a few trivia into A, and to spend the freed human attention on the things that genuinely need a brain: ambiguous framing, architectural decomposition, core-logic review, and the merge button.


The matrix

= where we are today = strawman target = stretch (later)

Stage H · human-led P · AI proposes, human disposes A · agent-led
1 · Sense a need ★ weekly digest agent
2 · Frame the issue ★ Claude interviews the author
3 · Triage ★ triage skill (comment-only) ☆ auto-label trivia
4 · Investigate ★ reproduce + locate skills
5 · Decompose epics (one person) ★ shared skill + critique pass
6 · Implement (novel core) ★ bounded features ☆ trivia, auto-accept
7 · Review & verify ★ creator–verifier + agent-native verify
8 · Merge & release ★ distinct-approver + same CI gates ☆ Tier-1 auto-merge

The same picture as a flow — blue = stages where the strawman pushes a bounded task-type into AI-proposes this iteration; grey = stays human-led (the off-distribution core):

flowchart LR
    S1[1 Sense] --> S2[2 Frame] --> S3[3 Triage] --> S4[4 Investigate]
    S4 --> S5[5 Decompose] --> S6[6 Implement] --> S7[7 Review & verify] --> S8[8 Merge & release]

    classDef human fill:#e8e8e8,stroke:#888,color:#000;
    classDef propose fill:#cfe8ff,stroke:#2b7,color:#000;
    class S5,S6,S7 human;
    class S1,S2,S3,S4,S8 propose;
Loading

Two things to notice. First, nearly everything we do is in the leftmost column — that's the finding, not a criticism. Second, the autonomy level belongs to the task-type, not the stage. "Implement" isn't async or sync; a docs typo is async, a change to the sampling loop is sync. The grid is a decision aid for each piece of work, not a fixed rota.


The loop inside every stage (the vendor-neutral "how")

The matrix says who drives each stage; it doesn't say how the work runs once an agent is involved. The whole industry has converged on one answer, and it's deliberately not a Claude-specific one — spec-driven development:

Specify → Plan → Tasks → Implement      (human review gate at every →)

Specify the what & why, plan the how, break it into small dependency-ordered tasks, implement them as focused diffs you can actually review — with a human gate at each boundary and a "constitution" of durable project rules (ours is AGENTS.md). GitHub Spec Kit (open-source, model-agnostic, 30+ agents), AWS Kiro and Tessl are all the same shape; Microsoft, Anthropic and Google converged on Spec Kit as the interoperable layer. The golden rule is "use the minimum specification rigour that removes ambiguity" — spec-first for most work, spec-anchored for long-lived core.

We already ship the engine for the Tasks phasem decompose — and our "interview-the-author" framing (stage 2) is just /specify + /clarify. So adopting the loop is mostly wiring our own parts together, not importing a vendor workflow.


What being 6 people on an OSS LLM library rules out

A lot of the 2026 literature is written for large product teams with hosted services, paid agent fleets, and dedicated platform engineers. We are not that, and the strawman is deliberately scoped down to fit what we are:

  • No parallel agent fleets. Cursor runs "hundreds"; Factory warns they conflict and runs serially. At 6 people on a shared main, serial-with-good- worktrees is the right default — fleet orchestration is a non-goal.
  • No persistent watcher / runtime-authority agents. We have no prod telemetry to watch (we're a library), and a 6-person repo doesn't need a standing autonomous service with its own session-authority budget. Sensing is a periodic digest, not a daemon.
  • No bespoke governance engine. GitHub branch-protection gives us author≠approver and CI gates for free on a public repo; we don't build a runcycles-style authority broker — we cite it as the direction, adopt the free 90%.
  • Verification is evals, not a web-app harness. The Anthropic three-beat is framed for UIs (DOM contracts, Playwright). Our equivalent is our own eval stack (m eval, TestBasedEval, BenchDrift, pytest tiers) — and it happens to be exactly the thing Mellea exists to provide.

The bias everywhere: adopt the free 90%, cite the expensive 10% as direction.


Walking the stages

Each one: where we are → what I'd propose → and the evidence or precedent.

1 · Sense. Today — we over-serve this with personal scouting skills (hn-scout, blog-idea-scout, research-project) but nothing systematically feeds the backlog. Proposed (P) — one weekly digest agent watching the Granite ecosystem, our dependents, and the paper/HN feeds, producing "things Mellea might want to respond to" that seed iteration planning. Caveat for us — we're a library, not a hosted product, so we have no prod telemetry to watch; the Devin Auto-Triage "watch the error stream" shape doesn't map. Our signal is external (ecosystem releases, dependents, issues, papers), which is exactly what our scouting skills already read — so this is the cheapest stage to make periodic. Precedent — Devin Auto-Triage proposes-never-merges (May 2026); Cherny: "Claude is starting to come up with ideas."

2 · Frame. Today — issues are written free-hand; quality varies wildly. Proposed (P) — a flow where Claude interviews the author into a well-formed issue (problem-not-solution, reproducer, scope) before it's filed. Precedent — this is Anthropic's first beat: "the requirements are latent within you; Claude is better at extracting them than you are at stating them." Cheap, no infrastructure, high leverage.

3 · Triage. Today — internal issues get a same-day response; external bug reports sit 30–60 days (#775, #885, #911 among them). Proposed (P) — a triage skill that classifies (bug/feat/docs/area), proposes labels + a size + linked duplicates, and posts a comment for a human to approve — it does not silently mutate the tracker. Plus a norm: every external report triaged within 24h, and we track that number. Precedent — Devin Auto-Triage's read-only-first shape.

4 · Investigate. Today — our barest stage; only a partial fix-bug skill. Proposed (P) — reproduce-bug and locate (call-stack / file-finder) skills, paired with fix-bug. Precedent — Anthropic's API and Inference teams use Claude as the "first stop" to find which files a task touches: seconds, instead of a colleague round-trip.

5 · Decompose. Today — it works, but it's one person. planetf1 + AI decomposed #929 and #891 into clean phase/wave sub-issues, each in a single sitting. That's a single point of failure dressed up as a strength. Proposed (P) — generalise it into a team-shared decomposition skill with a built-in critique pass, so any reviewer can drive it. We already ship the engine for this: m decompose parses a prompt into dependency-ordered subtasks, extracts the constraints, tags each one "code" vs "llm"-judge validation, and emits a runnable m.instruct() script + JSON — so the shared skill wraps our own CLI and adds the critique pass on top, rather than reinventing decomposition. Precedent — Factory Droid's coordinator/critique separation. Also worth a look — our design-via-draft-PR ceremony is heavy (#1080 was 1,813 lines across 29 commits for a single-shot artefact). Anthropic's second beat suggests a denser, click-through artefact (even HTML) as a lighter feedback surface than a giant markdown PR.

6 · Implement. Today — strong, Mellea-specific, and already ~37% of our commits carry an AI trailer. Proposed — make the async-vs-sync call explicit per task: bounded/peripheral work (edge cases, tests, docs) goes to P or A (auto-accept loops, checkpoint from a clean git state — Anthropic's DS team: "treat it like a slot machine… let it run for 30 minutes, then either accept the result or start fresh"); core logic stays in H with real-time supervision. The boundary that matters — Karpathy's rule (Sequoia 2026): "you're either in the data distribution, on the rails of the RL circuits, and flying, or you're off-roading in the jungle with a machete." Agents are on-rails for code that recurs online and that labs train against (verifiable + commercially valuable); they go off-road on novel, precisely-arranged code — which is why he hand-wrote nanochat ("too far off the data distribution"). Much of Mellea's core is exactly that off-distribution code — the sampling loop, context management, the generative-function machinery, novel intrinsic/adapter wiring — so it stays H. Our P/A payoff is the periphery that does occur all over the internet: a new backend that mirrors an existing one (HF/OpenAI/Ollama/Watsonx/LiteLLM all share a shape), test scaffolding, docs, example files, telemetry-field plumbing. Precedent — Vim mode was "roughly 70%… Claude's autonomous work"; RL team: "try one-shot first… works about a third of the time."

7 · Review & verify. Today — strong review skills, but our long tail is review fatigue, not complexity: 26-commit PRs where the author makes fix-up commits in response to feedback (squash-merge hides the cycle count). This review burden is an industry finding, not just ours — SPACE measures it across 415 devs; and a peer LLM library, vLLM, goes further and bans pure-agent PRs ("a human must understand and defend the change end-to-end"). Proposed (P) — (a) a creator–verifier split: a fresh-context review subagent that sees only the diff + criteria, not the reasoning that produced it (Factory's two-pass pipeline; Anthropic's adversarial /code-review), with apply-mode — in practice as concrete as the standing prompt the X practitioner community converged on: "find the riskiest line; name the missing test"; (b) agent-native verification grounded in our own tooling — we're an LLM library, not a web app, so the "verify" surface isn't a DOM dashboard, it's evals. The same requirement runs off one definition three ways: m eval run / TestBasedEval (LLM-as-judge) for behaviour, BenchDrift for prompt/variation robustness, and the pytest tier suite (unit/integration/e2e/qualitative) headless in CI; (c) a commit-hygiene norm to kill the 20-commit tail; (d) extra scrutiny on any PR that deletes code (MSR'26). Precedent — Anthropic's third beat ("verify, not test"), translated to our domain. This is the most on-brand slice we have: Mellea is a requirements + verification framework — making an agent verify its own change with m eval is dogfooding our own thesis, not importing someone else's web-app pattern. And there's a published result behind it: RSTD (arXiv 2605.15425, May 2026 — IBM Research, built on Mellea) invokes the LLM only as narrowly-scoped judgment operators with schema-validated outputs, and on a validation failure issues a targeted repair prompt rather than re-running the whole task — exactly Mellea's Instruct-Validate-Repair loop — cutting retry cost 73.2% vs static decomposition and 51.7% vs monolithic at ~18% framework overhead. Our verify-and-repair pattern isn't just on-brand, it's measured.

8 · Merge & release. Today — entirely human. The industry consensus here is unusually one-directional: don't auto-merge unreviewed, and the author may not approve their own change. GitHub enforces it ("the developer who asks the agent to open a pull request cannot be the one to approve it"); MSR'26 found 77.5% of agentic PRs were merged by the submitter and flags it as the dominant risk; Devin and Factory keep human merge authority as a design invariant; the peer library vLLM makes it explicit — a human must understand and defend the change end-to-end. Proposed (P — cheap and consensus-backed) — a branch ruleset requiring an independent approver (requires_distinct_approver), the same CI gates for agent PRs as human ones, an agent-provenance field in the PR template, and extra scrutiny on code-deleting PRs. All of this is native GitHub branch-protection — free on a public repo, no infrastructure to build or run. Stretch (A — argue about it) — auto-merging a tiny class (doc typos, dependency bumps with green CI) is further than Devin, Factory and GitHub will go today; bring it as a tension, not a default. Precedent — runcycles.io tier model; GitHub Well-Architected "governing agents"; MSR'26 LGTM!.


Application to us — what our GitHub actually shows

The matrix is generic; this part isn't. I pulled the last ~90 merged PRs (window from 2026-03-01) to ground the three challenges we named — who/when to review, how to split large work, better tool use — in our own data rather than industry anecdote.

Who/when to review — the merge button is loose, the reviewer is a SPOF. Good news first: only 1 of 90 PRs had no external review — we do review each other's work. But 76 of 90 (84%) were merged by the author themselves, and planetf1 reviewed essentially all 90 (jakelorocco 55, ajbozarth 53, psschwei 36 behind). So the review habit is healthy; the merge-authority habit is not (industry consensus: author≠approver), and review load sits on one person. The cheap fixes map straight onto stage 8: a branch ruleset for independent approver + a lightweight reviewer rota so planetf1 isn't the bottleneck for every change.

Splitting large work — a long tail we already feel. Median PR is +58 lines — small and reviewable. But the tail is heavy: p90 +1,220, max +2,436, and 14 PRs over +500 lines. Those are the 20-/26-commit fix-up marathons. This is the concrete case for the decomposition skill (stage 5) and the commit-hygiene norm (stage 7): the median shows we can ship small; the tail shows we don't always choose to. There's peer precedent for the discipline: Instructor's CLAUDE.md tells its agents to "keep PRs small and focused" and to "use stacked PRs for complex features."

Better tool use — we own a frontier toolbox; are we using its breadth? The one signal I can measure is commit trailers: of AI-attributed commits, almost all credit Claude (Bob a distant second; Codex/GPT and Gemini barely appear). Big caveat: a trailer is an attribution marker, not a usage meter — Codex and Gemini work that simply isn't trailer-marked is invisible here, so this is not evidence those models are unused. (If we want the real answer, LiteLLM proxy logs would tell us; I'm not going to guess at proxy usage.) What the trailer data does suggest is a habit worth examining: when we mark AI help, it's overwhelmingly one model. The cheap experiment — no new tool, no new spend — is to deliberately reach for a second model where its shape fits: Codex/GPT for a second-opinion review pass, Gemini's long context for whole-subsystem investigation (stage 4), the batch-cluster models for offline eval/judge runs that don't need to be interactive.

These three are the spine of the "tangible steps" half of the session — each one is a quick win with a number behind it.


The cross-cutting one: skills as a product

The thing the inventory surfaced that I didn't expect: our skills are drifting silently. 17 of mine are symlinks into a colleague's clone — a git pull in someone else's repo changes how my agent behaves, with no notice. Team skills vary person to person. Bob's MCP config is empty, so Bob users are flying with less.

This is exactly where the vendor-neutral evidence is strongest, so we don't have to lean on Anthropic for it. Thoughtworks' April 2026 radar puts "curated shared instructions for software teams" at ADOPT and calls the thing we're doing today — every developer hand-rolling their own prompts and skills — "an emerging anti-pattern"; the fix it recommends is anchoring AGENTS.md into a shared template. And the portable unit is a real open standard: AGENTS.md (introduced by OpenAI Aug 2025, donated to the Linux Foundation's Agentic AI Foundation Dec 2025 alongside MCP; 60K+ repos, 10+ native agents by Mar 2026). A 2,926-repo study (arXiv 2602.14690) found context files dominate and skills are mostly shallow, static instructions — and Vercel's own eval found repo-level AGENTS.md context beat tool-specific skills.

Proposal: treat shared AGENTS.md + skills + MCP config as a versioned product with evals, ground it in the portable AGENTS.md standard first (works across Claude, Bob, Codex, Gemini, Antigravity) with vendor skills as an optimisation layer on top, and adopt an end-of-session "refine the skill / AGENTS.md" habit (Thoughtworks calls this the feedback flywheel — a retrospective for the harness). Most of the quick wins below live here. (Karpathy's "install .md skills, not .sh scripts" is the same signal — the skill is the interface now, so it deserves the rigour we'd give any shipped artefact.)

We even have a first-party tool for the governed version of this: mellea-skills-compiler (generative-computing, Apr 2026) compiles a .md skill spec into a typed, instrumented Mellea pipeline (mellea-skills compile / /mellea-fy), then mellea-skills certify runs it through Granite Guardian + NIST AI RMF checks and emits a PolicyManifest + JSONL audit trail. That's "skills as a governed, evaluated, versioned product" already realised in our own ecosystem — the question for the meeting is how much of that rigour we adopt for the team's shared skills now, versus later.

The feedback loop is the habit that makes all of this compound. The single highest-leverage thing already in practice on the team is asking the agent, at the end of a session, "what did we learn that should be written down — a skill, an AGENTS.md rule, a gotcha?" and then actually committing the answer. That is the feedback flywheel (Thoughtworks, ADOPT) made concrete: every session that hits friction leaves the harness a little sharper, so the next agent doesn't repeat the mistake. It costs nothing and needs no infrastructure — it's a norm, not a tool. Three patterns worth standardising:

  • Capture-on-exit. End-of-session retro prompt → diff to a skill / AGENTS.md / CLAUDE.md. Make it a step in respond-to-pr-comment and the PR-wrap flow so it isn't reliant on one person remembering.
  • Recurring-review-comments → automation. Periodically ask an agent for the top-N review comments we keep making, then promote the mechanical ones into linters / a reviewer config so they stop costing human review cycles. (Federico Paolinelli does exactly this on MetalLB; it's the cleanest small-team instance of the flywheel feeding the feedback half of the harness — though note his talk is an idiomatic-Go cognitive-load talk, not an AI-agent one, cited here for the cognitive-load framing only.)
  • Agent memory — experimental, low-prescription. Persistent cross-session memory (project notes, decisions, branch/PR state) clearly helps but is hard to prescribe — it drifts, goes stale, and what's worth keeping is a judgement call. The honest team position: encourage individual experimentation, share what sticks, but only promote a pattern to a team norm once it's earned it. The durable stuff belongs in AGENTS.md/skills (reviewable, versioned); memory is the scratch layer for the things not yet stable enough to commit.

The bigger 2026 signal sits underneath this: the codebase, not the model, is the bottleneck. Sourcegraph's CodeScaleBench shows agents degrade past ~400K LOC, and that wiring in code-intelligence/MCP retrieval gave a +0.26 reward delta while running 30% cheaper and 38% faster — "the difference between complete failure and near-perfect completion wasn't intelligence, it was efficient access to context." DORA 2025: AI is "an amplifier, magnifying an organization's existing strengths and weaknesses." So our highest-value paid bet is code-intelligence / MCP retrieval wired into Investigate; the rest is making Mellea legible to models — a tight CLAUDE.md (pointers + gotchas, <200 lines), per-package context, LSP symbol search.


What we could ship this iteration (≈ a day, total)

Not perfect, just progress:

  1. Promote 4 skills to project-canonicalwriting-tests, code-review, respond-to-pr-comment, write-pr-body into mellea/.agents/skills/. Now everyone gets the project's conventions on git pull. (an afternoon)
  2. Vendor / submodule dev-rel-skills into a Mellea-owned, commit-pinned location — kills the silent-drift ring entirely. (half a day)
  3. Run fewer-permission-prompts and fold the canonical permissions into a tracked settings.json. (half an hour)
  4. Wire .bob/mcp.json (git, github-ibm at minimum) so Bob users are on par. (15 min)
  5. Norm: external bug reports triaged within 24h, and start counting. (no tooling, just a decision)
  6. Adopt the end-of-session "improve the skill" habit. (free)
  7. Turn on branch protection: independent approver + same CI gates for agent PRs, and add an agent-provenance line to the PR template. (15 min — and it's the single highest-consensus practice in the 2026 industry.)
  8. Try a second model on the proxy. One deliberate Codex (GPT) or Gemini pass as a second-opinion reviewer / long-context investigator this iteration — same LiteLLM proxy we already run Claude through, just a model we reach for less. (per-PR, no setup)

What I want us to actually argue about

These are the tensions; I have evidence for each, not answers.

# Tension The evidence
1 Is decomposition a shared skill or a specialist's craft? One person + AI does ~all of it today
2 How do we close the external-triage gap? #775, #885, #911 sit 30–60 days
3 Do we change our review norm to kill the long tail? 26-commit PRs are fatigue, not complexity; SPACE measures review burden across 415 devs; Instructor mandates small, focused PRs
4 Do skills go canonical, or stay personal? 17 silent symlinks; 2026 signal = skills-with-evals
5 What's the smallest thing an agent may merge alone — or is the answer nothing? Consensus is "don't auto-merge; enforce author≠approver" (GitHub; MSR'26: 77.5% self-merged). The poles: vLLM (human must defend end-to-end) vs Zig (bans AI contributions outright — "invariably garbage", "contributor poker")
6 Which process metrics do we adopt? We now have baselines: TTM median 1.3d, p90 8.4d
7 Do we publish how a 6-person team does this? We're literally a generative-computing project
8 Parallel agent fleets, or serial? Cursor runs hundreds; Factory runs serially ("agents conflict") — at 6 people, probably serial
9 Do we spread review load off one person, and use the breadth of models we already pay for? planetf1 reviewed ~all 90 PRs; AI trailers credit Claude almost exclusively (a habit signal, not a usage meter)

Conclusions & actions

Three things I take away from all of the above, then the master action list.

1. We are all in one column — that's the opportunity, but the controlled evidence says move carefully. Every stage runs human-led today; the win is pushing bounded, peripheral task-types into "AI proposes / human disposes" behind a gate. The two best controlled studies — METR's −19% on familiar code, SPACE's "spurious" gains swallowed by review burden — agree: the payoff is on the periphery and on unfamiliar code, not the off-distribution core we know cold. Draw that line on purpose.

2. The cheapest wins are governance and skills-as-a-product, and they're high-consensus. Author≠approver is free on a public repo and backed by everyone — GitHub, MSR'26, and peer libraries like vLLM that flatly ban pure-agent PRs. Our skill drift is real, measurable, and fixable this iteration. None of it needs new infrastructure — it needs a decision.

3. The deep bets are where our own tools already put us ahead. Decomposition-as-a-shared-skill (we ship m decompose), evals-as-verification (we are an eval framework; RSTD measured the repair-not-rerun payoff), and codebase legibility (the real bottleneck) are where Mellea is better positioned than most. The strategic move is to dogfood our own stack, not import a vendor's.

The master action list

The shape of it — three lanes, left to right:

flowchart LR
    subgraph NOW[Do now - no debate]
        N1[Promote 4 skills canonical]
        N2[Vendor dev-rel-skills pinned]
        N3[fewer-permission-prompts]
        N4[.bob/mcp.json]
        N5[Branch protection]
        N6[End-of-session skill habit]
        N7[Triage external bugs 24h]
        N8[Second model on the proxy]
    end
    subgraph NEXT[Next iteration]
        X1[Triage skill]
        X2[Decomposition skill]
        X3[Reproduce + locate]
        X4[Frame-issue interview]
        X5[Pre-merge-readiness]
        X6[Commit hygiene + linters]
        X7[Skill evals]
    end
    subgraph DECIDE[Decide in the meeting]
        D1[Periphery/core line]
        D2[Decomp: shared vs specialist]
        D3[Review norm + rota SPOF]
        D4[Smallest auto-mergeable unit]
        D5[Process metrics]
        D6[Publish as case study]
    end
    NOW --> NEXT --> DECIDE
Loading

Do now — no meeting needed, owners suggested:

# Action Stage Effort Owner Decide
1 Promote 4 skills (writing-tests, code-review, respond-to-pr-comment, write-pr-body) to project-canonical cross-cut afternoon skills owner now
2 Vendor/submodule dev-rel-skills, commit-pinned cross-cut ½ day skills owner now
3 Run fewer-permission-prompts → tracked settings.json cross-cut ½ hr settings owner now
4 Wire .bob/mcp.json (git, github-ibm) cross-cut 15 min Bob users now
5 Branch protection: independent approver + same CI for agent PRs + provenance line 8 15 min repo admin now
6 End-of-session "improve the skill / AGENTS.md" habit cross-cut free team norm now
7 Norm: external bug reports triaged ≤24h + start counting 3 a decision team norm now
8 Try a second model on the proxy — one deliberate Codex (GPT) or Gemini pass as second-opinion reviewer / long-context investigator, via the same LiteLLM proxy we already run Claude Code through 4/7 per-PR anyone now

Next iteration — decide to do them now, assign owners in the meeting:

# Action Stage Effort Owner Decide
9 Triage skill (comment-only) 3 assign in meeting now
10 Shared decomposition skill wrapping m decompose + critique pass 5 assign in meeting now
11 Reproduce-bug + locate skills 4 assign in meeting now
12 Frame-issue interview flow 2 assign in meeting now
13 Pre-merge-readiness skill 8 assign in meeting now
14 Commit-hygiene norm/hooks for the long tail 7 assign in meeting now
15 Recurring-review-comments → linters 7 assign in meeting now
16 Skill evals (skill-creator / mellea-skills certify) cross-cut assign in meeting now

Decide in the meeting — genuinely open questions:

# Action Stage Effort Owner Decide
17 Where to draw the periphery/core line (METR + SPACE) 5/6 team meeting
18 Decomposition: shared skill vs specialist's craft 5 team meeting
19 Review-norm change + reviewer rota (planetf1 is the review SPOF; SPACE) 7 team meeting
20 Smallest auto-mergeable unit — or nothing (vLLM vs Zig poles) 8 team meeting
21 Which process metrics we adopt cross-cut team meeting
22 Publish-as-case-study cross-cut team meeting

The onepager (concept-to-impl-onepager.md) holds the visual version of this action set — now → next → decide — at a glance.


Appendix — the numbers behind the claims

Mellea, last 60 days (2026-04-01 → 06-01): 207 issues opened / 230 closed (net backlog +94). 161 PRs merged (Apr 97, May 64). Time-to-merge: median 1.3d, mean 3.5d, p90 8.4d, max 34.3d. Four people merged 67% of PRs. ~37% of commits carry an AI trailer (a floor — not everyone marks consistently).

Merged-PR analysis (≈90 PRs, window from 2026-03-01): 84% self-merged (76/90) yet only 1/90 had no external review — review habit healthy, merge-authority habit loose. Reviewer load concentrated: planetf1 reviewed ~all 90 (jakelorocco 55, ajbozarth 53, psschwei 36). PR size: median +58 lines, p90 +1,220, max +2,436, 14 PRs over +500. AI-commit trailers: Claude ≈222, Bob 27, Codex/GPT 1–2, Gemini 0 — but trailers measure attribution marking, not model usage, so this is a habit signal, not proof the other models are idle (proxy logs would settle it).

Vendor-neutral sources (the industry-led spine): METR — Measuring the Impact of Early-2025 AI on Experienced OSS Developer Productivity (RCT, 2025-07-10, −19%); Thoughtworks — Technology Radar Vol 34 (2026-04-15: harness engineering, curated-shared-instructions ADOPT, feedback flywheel, agent-instruction-bloat CAUTION); DORA — State of DevOps 2025 (AI-as-amplifier; 7-capability model); GitHub — Spec-driven development + Spec Kit (model-agnostic, 30+ agents); arXiv 2602.00180 (spec-driven development) + 2602.14690 (AGENTS.md as interoperable standard); AGENTS.md / Agentic AI Foundation (Linux Foundation); SPACE — Fast and Spurious (arXiv 2510.24265; 415 practitioners, review burden offsets speed gains); Anthropic — skill-formation RCT (arXiv 2601.20245; 50% vs 67% comprehension, d=0.738). Peer LLM libraries: vLLM — AGENTS.md ("Pure code-agent PRs are not allowed. A human submitter must understand and defend the change end-to-end.", github.com/vllm-project/vllm); Instructor — CLAUDE.md ("Use stacked PRs for complex features"; "Keep PRs small and focused", github.com/instructor-ai/instructor); Zig — Code of Conduct (bans AI contributions outright: "invariably garbage", "contributor poker", ziglang.org).

Key 2026 sources: Karpathy — Software Is Changing (Again) (YC, 2025-06-17), From Vibe Coding to Agentic Engineering (Sequoia AI Ascent 2026, 2026-04-29) + Dwarkesh interview (2025-10); Anthropic — How we Claude Code (Claude channel, 2026-05-23) and How Anthropic teams use Claude Code (report); runcycles.io — When Coding Agents Press Merge; Sourcegraph — The Coding Agent Is Dead + CodeScaleBench; GitHub — Well-Architected Governing agents; Cognition — Devin Auto-Triage; Factory — Droid review/creator-verifier; MSR'26 — LGTM!; Cherny/Wu on Lenny's Podcast. First-party / Mellea: m decompose (docs.mellea.ai); mellea-skills-compiler (github.com/generative-computing/mellea-skills-compiler); RSTD — Runtime-Structured Task Decomposition for Agentic Coding Systems (arXiv 2605.15425; Asthana et al., IBM Research, built on Mellea).

Verified quotes (checked against primary source, for the team-share):

  • Autonomy slider (Karpathy, "Software 3.0", 2025-06-17): "you are in charge of the autonomy slider, and depending on the complexity of the task at hand, you can tune the amount of autonomy that you're willing to give up"; "we have to keep the AI on the leash… it's not useful to me to get a diff of 10,000 lines of code… I'm still the bottleneck"; "less Iron Man robots and more Iron Man suits… partial autonomy products… so that the generation-verification loop of the human is very, very fast."
  • The agent boundary (Karpathy, Dwarkesh 2025-10; nanochat, HN 2025-10-13): agents handle "boilerplate… stuff that occurs very often on the Internet" but fail on "intellectually intense… precisely arranged" code — he hand-wrote nanochat because "the repo is too far off the data distribution."
  • Agentic engineering / on-rails-vs-off-roading (Karpathy, Sequoia AI Ascent 2026, 2026-04-29 — corroborated by Sequoia's channel + his bearblog summary + multiple transcripts): "vibe coding raised the floor; agentic engineering raises the ceiling… you are still responsible for your software just as before"; "you're either in the data distribution, on the rails of the RL circuits, and flying, or you're off-roading in the jungle with a machete" (and why: labs train on what's verifiable + commercially valuable); "we're not building animals, we are summoning ghosts… jagged intelligences shaped by data and reward functions"; "you can outsource your thinking, but you can't outsource your understanding." Also, agent-native ergonomics: "install .md skills, not .sh scripts."
  • Async vs sync (report, Claude Code team): "Fast prototyping with auto-accept mode … autonomous loops" vs "Synchronous coding for core features … detailed prompts with specific implementation instructions."
  • Slot machine (report, Data Science/ML team): "Treat it like a slot machine. Save your state… let it run for 30 minutes, then either accept the result or start fresh." Also: "stop Claude and ask 'why are you doing this? Try something simpler.' The model tends toward more complex solutions by default."
  • One-shot first (report, RL Eng): "Try one-shot first… If it works (about one-third of the time), you've saved significant time."
  • Vim mode (report): "roughly 70% of the final implementation came from Claude's autonomous work."
  • Slash commands compound (report, Security Eng): "Security engineering uses 50% of all custom slash command implementations in the entire monorepo."
  • Checkpoints / end-of-session / MCP-over-CLI (report): "starting from a clean git state and committing checkpoints"; "summarize completed work sessions and suggest improvements … to refine the CLAUDE.md documentation"; "use MCP servers rather than the BigQuery CLI to maintain better security control."
  • Author ≠ approver (GitHub): "the developer who asks the agent to open a pull request cannot be the one to approve it." MSR'26 LGTM!: 77.5% of agentic PRs merged by the submitter; maintainers tighten the gate on code-deleting PRs.
  • Context is the bottleneck (Sourcegraph, "The Coding Agent Is Dead", 2026-02): "the agent… is no longer the limiting factor… how you organize your codebase for agents… those are now the bottlenecks." DORA 2025: AI is "an amplifier."
  • Three-beat SDLC (How we Claude Code video — verified against the video): remove ambiguity ("the requirements are latent within you", Bitter Lesson) → dense HTML artifacts over long markdown ("if the markdown files get more than about 200 lines long, it's unlikely you're going to read it… certainly unlikely your colleagues are" — Tar, "the unreasonable effectiveness of HTML files") → "verify, not test" (three surfaces off one definition: human dashboard / agent-first via Playwright MCP / headless CI — bun verify).

Concept → Implementation: the research behind the strawman

Background read for the Wed 2026-06-03 session. This is the evidence layer — the strawman (concept-to-impl-strawman.md) is the argue-with layer, the one-pager (concept-to-impl-onepager.md) is the skim layer. Read this if you want the "why", or feed it to your own agent before the meeting.

Organised by theme, not by vendor — deliberately, because the biggest risk in this space is mistaking one lab's workflow for the industry's direction. We use Claude Code heavily and it shows up often below; but every load-bearing claim is cross-checked against a vendor-neutral source (Thoughtworks, DORA, METR, GitHub/Linux-Foundation standards, peer-reviewed arXiv) so the strawman rests on the industry's direction, not one lab's house style. Where a claim is a verbatim quote checked against the primary source it's marked [verified]; where it's a lead from a relay or a single secondary source it's marked [lead].


0. The shape of the argument

  1. The models crossed a line in late 2025: for a growing class of work, the human no longer needs to be in the driving seat. [verified] (Karpathy's own Nov→Dec 2025 inflection: ~80% hand-written → ~80% delegated.)
  2. But the evidence is sober, not hyped. The best controlled study to date (METR RCT) found experienced devs on familiar, complex repos went 19% slower with AI while believing they were faster. The benefit is real but uneven — it lands on the periphery, not the core (§2). This is the single most important caveat for a team of senior devs on a large familiar codebase.
  3. The discipline that replaces "vibe coding" is agentic engineering: go faster without dropping the quality bar. The control surface is a harness — feedforward controls (skills, specs, curated instructions) that aim the agent, plus feedback controls (linters, type-checkers, evals, mutation tests) that catch it before a human does (§3, Thoughtworks Vol 34).
  4. The industry has converged — across vendors — on a concrete operating loop inside that harness: spec → plan → tasks → implement, with a human gate at each boundary (§4), plus verify-not-test (§7), author≠approver at merge (§8), and context/codebase as the real bottleneck (§9). DORA 2025: AI is an amplifier — it magnifies whatever discipline already exists.
  5. Mellea is unusually well-positioned: we already ship the pieces (m decompose, a requirements/validation/repair core, an eval stack, an AGENTS.md, mellea-skills-compiler). The opportunity is to adopt the convergent loop using our own tools, not to import a vendor's workflow.

1. Autonomy as a slider (the column axis)

The cleanest framing is Karpathy's, and it predates and outlives any single tool.

  • The slider [verified, Software Is Changing (Again), YC, 2025-06-17]: "you are in charge of the autonomy slider, and depending on the complexity of the task at hand, you can tune the amount of autonomy that you're willing to give up."
  • The leash [verified, same]: "we have to keep the AI on the leash… it's not useful to me to get a diff of 10,000 lines of code… I'm still the bottleneck." Unconstrained, agents get "lost in the woods" [lead, relayed].
  • Suit, not robot [verified, same]: "less Iron Man robots and more Iron Man suits… partial autonomy products… so that the generation-verification loop of the human is very, very fast."
  • Agentic engineering [verified, Sequoia AI Ascent 2026, 2026-04-29]: "vibe coding raised the floor; agentic engineering raises the ceiling… you are still responsible for your software just as before." Also: "we're not building animals, we are summoning ghosts… jagged intelligences"; "you can outsource your thinking, but you can't outsource your understanding."
  • The capability boundary [verified, Sequoia 2026 + Dwarkesh 2025-10 + nanochat]: "you're either in the data distribution, on the rails of the RL circuits, and flying, or you're off-roading in the jungle with a machete." Agents are on-rails for code that recurs online and that labs train against (verifiable + commercially valuable); they go off-road on novel, precisely-arranged code — he hand-wrote nanochat because "the repo is too far off the data distribution."

Why this matters for us: the slider is per task-type, not per stage or per person. The same "Implement" stage is async for a docs typo and synchronous for the sampling loop. Our core (sampling, context management, generative-function machinery, novel intrinsic/adapter wiring) is the off-distribution code that stays human-led; our periphery (a backend mirroring an existing one, test scaffolding, docs, telemetry plumbing) is where delegation pays.


2. The sober counterweight: where AI does not help (read this first)

The most industry-credible finding is also the most uncomfortable, and it is not from a vendor.

  • METR RCT [verified, METR, 2025-07-10]: a randomised controlled trial — 16 experienced open-source developers, 246 real tasks on their own mature repos. With AI tooling (Cursor Pro + Claude 3.5/3.7 Sonnet) they were 19% slower, while they believed they were ~20% faster. Forecasts were wildly off: the developers predicted 24% faster; ML and economics experts predicted 38–39% faster. The slowdown drivers were high repo familiarity, large/complex codebases, and AI unreliability — the agent's suggestions cost more time to read, verify and correct than they saved. The one developer who saw a speedup had >50 hours of prior Cursor experience. (A Feb 2026 follow-up on late-2025 tools narrows but does not erase the gap.)
  • Why this is the most relevant single study for us: we are exactly the population it measured — experienced developers on a large, familiar codebase. The naive read ("agents make everyone faster") is the read METR falsifies. The defensible read: agents pay off on the periphery and on unfamiliar code, and cost time on the core you know cold — which is precisely why the slider (§1) and task-classification (§5) matter more than raw adoption.
  • SPACE "Fast and Spurious" survey [verified, arXiv 2510.24265, 2025-10-28]: 415 software practitioners scored on the SPACE framework. Frequent GenAI users report faster task completion and higher output volume, but the gains are offset by increased code-review burden, persistent cognitive load from verifying output, and unchanged collaboration — so the authors conclude "perceived productivity gains may be spurious." (NSF mirror: par.nsf.gov/biblio/10677745.)
  • DORA 2025: "AI is an amplifier" [verified, DORA 2025, ~5,000 professionals]: 90% adoption, 80%+ believe productivity is up — yet AI adoption shows a negative relationship with delivery stability, and ~30% report little or no trust in AI-generated code. AI magnifies existing strengths and weaknesses; it does not manufacture discipline.
  • Anthropic skill-formation RCT [verified, arXiv 2601.20245 / anthropic.com/research/AI-assistance-coding-skills, 2026-01-29]: 52 mostly-junior engineers built an async Python library. The AI-assisted group scored 50% vs control 67% on a comprehension quiz (~2 letter grades lower; Cohen's d=0.738, p=0.01) with no significant speedup. A delegation interaction pattern predicted low scores; asking conceptual questions predicted high. The skill-atrophy caveat that pairs with METR — speed without understanding is the failure mode.
  • "Cognitive debt" [verified, Thoughtworks Tech Radar Vol 34, 2026-04-15]: one of the radar's four themes is retaining principles while relinquishing patterns — teams that let agents write everything lose the understanding Karpathy warns you can't outsource. The radar's response is a deliberate return to fundamentals: DORA metrics, pair programming, mutation testing, clean code — the disciplines that keep a human in command of an AI-amplified codebase.

Takeaway for the room: the goal is not "maximise autonomy". It's "move the specific task-types where agents are reliable leftward, and defend the core where they aren't." Everything downstream is in service of that distinction.


3. Harness engineering: the vendor-neutral umbrella

Thoughtworks' Vol 34 (2026-04-15) gives the cleanest non-vendor name for the whole strawman: "putting coding agents on a leash" via harness engineering. It splits the control surface into two halves — and everything in this document slots into one of them.

  • Feedforward controls — aim the agent before it runs: Agent Skills, spec-driven development, curated shared instructions. (= §4 SDD, §10 skills.)
  • Feedback controls — catch the agent after it runs, ideally before a human looks: linters, type-checkers, mutation testing, custom LSPs that trigger self-correction. (= §7 evals, our pytest tiers, ruff/mypy.)

The radar's other relevant calls (a useful neutral cross-check on our specifics):

  • Context engineering — ADOPT. Curating what the agent sees is now baseline practice, not an experiment.
  • Curated shared instructions for software teams — ADOPT. Verbatim: "relying on individual developers to write prompts from scratch is emerging as an anti-pattern." The recommended fix is to anchor CLAUDE.md / AGENTS.md into shared service templates — which is exactly our skills-as-a-product slice (§10), independently arrived at.
  • Agent instruction bloat — CAUTION. Hand-written AGENTS.md often beats LLM-generated; longer is not better. (Matches Anthropic's <200-line rule, §5, from a different source.)
  • Feedback flywheel — ASSESS. Treat spec → plan → implement + continuous harness improvement like a team retrospective: every session sharpens the harness. (= the end-of-session habit in §5, generalised and de-Anthropic'd.)
  • Agent Skills — TRIAL; MCP by default; progressive context disclosure; skills-as-executable-onboarding-docs. All consistent with our §10.
  • Securing permission-hungry agents — zero-trust, sandboxing, the "lethal trifecta" (untrusted input + private data + exfiltration path). A standing caution for any auto-merge or MCP-tool grant (§8).

The aggressive pole [verified, OpenAI Harness engineering: leveraging Codex in an agent-first world, openai.com/index/harness-engineering/, 2026-02-11]: OpenAI uses the same term — "harness engineering" — but pushes it toward far more autonomy: a literal "Ralph Wiggum Loop", mechanical "golden principles", and background tasks humans "aren't required to" review (companion pieces: Symphony with Linear as control plane, claimed +500% landed PRs; a separate Auto-review agent at ~99% approval). Same idea, opposite end of the slider — we take Thoughtworks' leash reading, not OpenAI's.

Why this matters for us: "harness engineering" is the framing to open the meeting with, because it is vendor-neutral and it makes the whole strawman one coherent idea — we are building Mellea's harness — rather than a list of Claude Code tricks. Our feedforward half is AGENTS.md + skills + m decompose; our feedback half is the eval stack. Both halves already exist; the work is wiring them into a loop.


4. The convergent operating loop: spec-driven development

SDD is the feedforward core of the harness, and it most reduces the strawman's Anthropic-centricity because it is vendor-neutral by construction — every major vendor has shipped a flavour of it.

The four-phase loop (names differ, structure is identical across GitHub Spec Kit, AWS Kiro, OpenSpec, BMAD) [verified, GitHub blog 2025-09-02; arXiv 2602.00180, 2026-01-30]:

Specify  → what & why (user journeys, acceptance criteria; not the tech stack)
Plan     → tech stack, architecture, constraints
Tasks    → small, independently testable, dependency-ordered chunks
Implement→ task by task; review focused diffs, not 1,000-line dumps

…with a human review gate at every phase boundary ("your role isn't just to steer, it's to verify" — GitHub) and a constitution of durable project rules (language, testing, dependency policy) that every phase must respect — usually stored as AGENTS.md or .specify/memory/constitution.md.

Three levels of rigour [verified, Martin Fowler / Böckeler 2025-10-15; arXiv 2602.00180]:

  • spec-first — write a spec for this task, then code (lightweight, today).
  • spec-anchored — keep the spec for the life of the feature, evolve it.
  • spec-as-source — the spec is the maintained artifact; code is regenerated, never hand-edited (Tessl; today only practical where generation is trusted, e.g. OpenAPI stubs, Simulink → certified C).

Golden rule [verified, arXiv 2602.00180]: "use the minimum level of specification rigor that removes ambiguity for your context." Spec-first for most of our work; spec-anchored for long-lived core; spec-as-source is not for us yet.

Tooling landscape [verified]:

  • GitHub Spec Kit — open-source CLI (specify), model-agnostic, works with Claude Code / Copilot / Cursor / Gemini CLI / Codex / opencode / Qwen / Kiro CLI and ~30 others. Slash commands: /constitution /specify /clarify /plan /tasks /analyze /implement /checklist. This is the portable reference implementation — Microsoft, Anthropic and Google have all converged on it as the interoperable layer.
  • AWS Kiro — agentic IDE; spec + plan + tasks + code in one workspace; "hooks" run test/lint/security after every agent action. Less portable.
  • Tessl — spec-as-source; code marked // GENERATED FROM SPEC - DO NOT EDIT; audit trails.

Claimed payoffs [lead, vendor-reported, treat as directional not measured]: GitHub — "roughly an order-of-magnitude fewer 'regenerate from scratch' cycles than ad-hoc prompting"; AWS Kiro — "40-hour features in under 8 hours of human time when authored as specs first."

Why this matters for us: SDD is the "how" the matrix is missing, and we already ship the engine for the Tasks phase: m decompose parses a prompt into dependency-ordered subtasks, extracts constraints, tags each "code" vs "llm"-judge validation, and emits a runnable m.instruct() script + JSON. The strawman's stage-2 "interview-the-author" is /specify + /clarify. Our "design-via-draft-PR" ceremony is a heavyweight, ad-hoc version of /plan. We can adopt the convergent loop with mostly our own parts.


5. Task classification & the async/sync split (Anthropic — as one instance)

Anthropic's published patterns are a concrete instance of the slider, useful because they're specific and measured — but they are one team's house style, not the standard. Cite them as illustration, not authority; the vendor-neutral versions live in §1 (slider), §3 (harness), §4 (SDD).

  • Async vs sync [verified, How Anthropic teams use Claude Code report]: "Fast prototyping with auto-accept mode… autonomous loops" for peripheral work vs "Synchronous coding for core features… detailed prompts with specific implementation instructions." (This is METR's periphery/core split, §2, stated as a workflow rule.)
  • Vim mode "roughly 70%… Claude's autonomous work" [verified, report].
  • One-shot first [verified, report, RL Eng]: "Try one-shot first… If it works (about one-third of the time), you've saved significant time."
  • Slot machine [verified, report, DS/ML]: "Treat it like a slot machine. Save your state… let it run for 30 minutes, then either accept the result or start fresh." Plus: "the model tends toward more complex solutions by default" — stop it and ask for simpler.
  • Checkpoints / end-of-session / MCP-over-CLI [verified, report]: clean git state + frequent checkpoints (= DORA "work in small batches", §11); end-of-session "summarize work and suggest improvements to refine CLAUDE.md" (= Thoughtworks "feedback flywheel", §3); "use MCP servers rather than the BigQuery CLI to maintain better security control."

The three-beat SDLC [verified, How we Claude Code video, 2026-05-23] — note this is the same Specify/Verify shape as SDD, framed for UIs:

  1. Remove ambiguity — "the requirements are latent within you" (Bitter Lesson); let the agent interview you. (= SDD /specify + /clarify.)
  2. Dense artifacts over long markdown — "if the markdown files get more than about 200 lines long, it's unlikely you're going to read it… certainly unlikely your colleagues are"; condense into clickable HTML (Tar, "the unreasonable effectiveness of HTML files"). (= Thoughtworks "agent instruction bloat — CAUTION", §3, from an independent source.)
  3. Verify, not test — one definition, three surfaces: human dashboard / agent-first (Playwright MCP) / headless CI (bun verify). Demo: a React to-do app emitting data-verify unit/total/done/active attributes; presenter plants 4+3≠10 and an agent catches it via the data contract.

Translation for us: beat 3 is framed for a web app. Our "verify" surface is evals, not a DOM — see §7.


6. Decomposition & runtime-structured repair

  • m decompose [verified, first-party, docs.mellea.ai] — our own dependency-ordered decomposition CLI with per-constraint code-vs-llm-judge validation tagging. This is the Tasks phase of SDD, already shipped.
  • RSTD [verified, arXiv 2605.15425, May 2026 — Asthana et al., IBM Research, built on Mellea] — Runtime-Structured Task Decomposition: the LLM is invoked only as narrowly-scoped judgment operators with schema-validated outputs; on a validation failure it issues a targeted repair prompt rather than re-running the whole task. That is exactly Mellea's Instruct-Validate-Repair loop. Result: 73.2% retry-cost reduction vs static decomposition, 51.7% vs monolithic, ~18% framework overhead, 100% correctness across configs. Citation note: IBM Research, not the Mellea core team — same-house (Mellea is IBM/generative-computing) but a different team. Cite as "IBM Research, building on Mellea", not "our paper".
  • Decomposition today is a single-operator pattern [verified, our repo]: planetf1 + AI decomposed #929 and #891 into clean phase/wave sub-issues, each in one sitting. Strength dressed as a single point of failure — the proposal is to generalise it into a shared skill that wraps m decompose and adds a critique pass (Factory Droid's coordinator/critique separation is the reference [lead]).

7. Verification = evals (our most on-brand slice, the feedback half of §3)

The industry signal is "verify, not test" — make verification native to the artifact and runnable off one definition. In Thoughtworks' terms (§3) this is the feedback half of the harness. For a library (not a web app) the verification surface is our eval stack, which is the thing Mellea exists to provide:

  • m eval run / TestBasedEval (LLM-as-judge) for behaviour [verified, first-party].
  • BenchDrift (IBM/BenchDrift) for prompt/variation robustness [verified, first-party reference].
  • the pytest tier suite (unit/integration/e2e/qualitative) headless in CI [verified, our AGENTS.md] — plus ruff/mypy as the cheap always-on feedback controls Thoughtworks names explicitly.
  • RSTD's judgment-operator + targeted-repair pattern (§6) is the measured version of "validate then repair, don't rerun".

The point for the room: making an agent verify its own change with m eval is dogfooding our own thesis, and it is exactly the feedback control the vendor-neutral radar prescribes — not importing someone else's web-app harness.


8. Review & merge governance (strong cross-vendor consensus)

This is the most one-directional finding in the whole survey — useful precisely because it's not one vendor's opinion.

  • Author ≠ approver [verified, GitHub]: "the developer who asks the agent to open a pull request cannot be the one to approve it."
  • Self-merge is the dominant risk [verified, MSR'26 LGTM!]: 77.5% of agentic PRs were merged by the submitter; maintainers tighten the gate when an agent deletes code.
  • Creator–verifier with fresh context [verified/lead]: Factory two-pass; Cursor judge; GitHub self-review — the reviewer subagent sees only diff + criteria, not the reasoning that produced the change. A ship-now instantiation the practitioner community converged on [lead, X]: "find the riskiest line; name the missing test."
  • Tiered merge authority [lead, runcycles.io 2026]: merge-to-main is execution-equivalent, needs session-level authority, not per-call permission. Bring as a direction, not a policy we write.

Peer-library contribution policies — what comparable projects already enforce:

  • vLLM AGENTS.md [verified, github.com/vllm-project/vllm/blob/main/AGENTS.md]: the peer LLM-serving library is explicit — "Pure code-agent PRs are not allowed. A human submitter must understand and defend the change end-to-end." Reviewers read every changed line; one-off busywork PRs are banned; AGENTS.md is kept <200 lines and domain guides <300. (Added in PR #36877, inspired by HuggingFace transformers.)

  • Instructor CLAUDE.md [verified, github.com/instructor-ai/instructor/blob/main/CLAUDE.md]: the peer Python LLM library prescribes "Use stacked PRs for complex features", "Keep PRs small and focused", a changelog-per-PR rule, and a short PR-description template (What/Why/Changes/Testing) — peer precedent for the splitting-large-work challenge.

  • Zig — the contrarian counterweight [verified, ziglang.org/code-of-conduct + JetBrains podcast youtube.com/watch?v=iqddnwKF8HQ, 2026-05-27]: a "Strict No LLM / No AI Policy". Andrew Kelley calls AI contributions "invariably garbage" of "negative value" and frames review as "contributor poker" — you bet review time against contribution quality — driven by 200+ open PRs against limited review capacity. Context: only 4 of 112 surveyed OSS projects ban AI outright (Zig, NetBSD, GIMP, QEMU). The strongest contrarian read in the survey, plus a review-economics framing.

  • Securing permission-hungry agents [verified, Thoughtworks Vol 34]: any auto-merge or broad MCP grant has to respect the "lethal trifecta" — don't give an agent untrusted input + private data + an exfiltration path at once.

  • It's free on a public repo: GitHub branch-protection gives author≠approver + CI gates with no infrastructure to build. Adopt the free 90%, cite the tier model as the expensive 10%.


9. Context/codebase is the bottleneck (not the model)

  • [verified, Sourcegraph "The Coding Agent Is Dead", 2026-02]: "the agent… is no longer the limiting factor… how you organize your codebase for agents… those are now the bottlenecks." CodeScaleBench: agents degrade past ~400K LOC; wiring code-intelligence/MCP retrieval gave +0.26 reward, 30% cheaper, 38% faster — "the difference… wasn't intelligence, it was efficient access to context."
  • [verified, DORA 2025]: AI is "an amplifier, magnifying an organization's existing strengths and weaknesses." A quality internal platform and AI-accessible internal context are two of DORA's seven AI-capabilities (§11).

For us: the highest-value paid bet is code-intelligence / MCP retrieval wired into the Investigate stage; the rest is making Mellea legible — a tight CLAUDE.md/AGENTS.md (pointers + gotchas, <200 lines), per-package context, LSP symbol search. This is the same "context engineering — ADOPT" the radar names.


10. Configuration & skills as a product — grounded in the open standard

This is where the "too Anthropic" risk is sharpest, and where the vendor-neutral answer is strongest. Thoughtworks puts curated shared instructions at ADOPT and calls hand-rolled per-developer prompting "an anti-pattern" (§3) — we reached the same conclusion from our own drift inventory.

  • AGENTS.md is the emerging interoperable standard [verified, agents.md; arXiv 2602.14690, 2,926 repos]: introduced by OpenAI Aug 2025; donated to the Agentic AI Foundation (Linux Foundation) Dec 2025 alongside MCP and goose; 60K+ repos and 10+ native agents by Mar 2026. The empirical study found Context Files dominate and "AGENTS.md emerging as an interoperable standard."
  • Skills are shallowly adopted and mostly static [verified, arXiv 2602.14690]: most repos define only one or two skills, and skills "predominantly rely on static instructions rather than executable workflows." Vercel's eval found repo-level AGENTS.md context outperformed tool-specific skills [lead, Harness/Vercel]. Implication: ground the team's shared conventions in AGENTS.md (portable across Claude, Bob, Codex, Gemini, Antigravity) first; treat Claude-specific skills as an optimisation layer on top, not the foundation.
  • The drift we found [verified, our inventory]: 17 of one machine's skills are silent symlinks into a colleague's clone (a git pull elsewhere mutates behaviour here); team skills vary; Bob's MCP config is empty. The fix is to treat shared config as a versioned, evaluated product — and the portable unit is AGENTS.md, not a vendor skill format. (Thoughtworks: anchor it in a shared service template.)
  • First-party governed version [verified, github.com/generative-computing/ mellea-skills-compiler, 2026-04-23]: compiles a .md skill spec into a typed, instrumented Mellea pipeline (mellea-skills compile / /mellea-fy), then mellea-skills certify runs Granite Guardian + NIST AI RMF checks and emits a PolicyManifest + JSONL audit trail. This is "skills as a governed product" already realised in our ecosystem.
  • Karpathy, same signal [verified, Sequoia 2026]: "install .md skills, not .sh scripts" — the skill is the interface now.

11. Measurement (vendor-neutral, and we have baselines)

  • DORA is the neutral frame, and DORA 2025 gives an actionable model rather than just metrics. The AI Capabilities Model names seven organisational capabilities that amplify AI's benefit [verified, DORA 2025]:
    1. a clear AI policy/stance, 2. a healthy data ecosystem, 3. AI-accessible internal data, 4. a quality internal platform, 5. strong version control with easy rollback, 6. working in small batches, 7. a user-centric focus. Capabilities 5 and 6 map directly onto cheap habits we can adopt this iteration (frequent checkpoints, small PRs, easy revert) — and they're the same habits Anthropic's "clean git state + checkpoints" describes (§5), from a neutral source.
  • The amplifier warning [verified, DORA 2025]: AI adoption had a negative relationship with delivery stability and ~30% of practitioners distrust AI-generated code — so the metrics will magnify whatever discipline (or lack of it) we already have. Measure before scaling autonomy.
  • Our baselines [verified, our repo, last 60 days 2026-04-01→06-01]: 207 issues opened / 230 closed (net +94); 161 PRs merged (Apr 97, May 64); time-to-merge median 1.3d, mean 3.5d, p90 8.4d, max 34.3d; four people merged 67% of PRs; ~37% of commits carry an AI trailer (a floor — not everyone marks).
  • Candidate process metrics: TTM, AI-plan revision count, fraction of issues triaged without a human read, external-triage latency.

12. What being 6 people on an OSS LLM library rules out

  • No parallel agent fleets. Cursor runs "hundreds"; Factory warns they conflict and runs serially. At 6 on a shared main, serial-with-good-worktrees is the default; fleet orchestration is a non-goal.
  • No persistent watcher / runtime-authority agents. No prod telemetry to watch (we're a library); Sensing is a periodic digest, not a daemon. The Devin Auto-Triage "watch the error stream" shape doesn't map.
  • No bespoke governance engine. GitHub branch-protection gives author≠approver + CI gates free; cite runcycles as direction, don't build it.
  • Verification is evals, not a web-app harness — §7.

Bias everywhere: adopt the free 90%, cite the expensive 10% as direction.


13. Open tensions (bring as questions, not answers)

  1. Given METR (§2), where exactly is our periphery/core line — what task-types do we actually trust to an agent on this codebase?
  2. Parallel agent fleets vs serial (Cursor "hundreds" vs Factory "agents conflict") — at 6 people, probably serial.
  3. Editor vs cloud vs terminal as the locus of work.
  4. How autonomous at merge — is the smallest auto-mergeable unit nothing? (Poles: vLLM — "a human must defend the change end-to-end"; Zig — ban AI contributions outright.)
  5. Spec rigour — spec-first everywhere, or spec-anchored for core?
  6. Skills canonical vs personal; AGENTS.md-first vs vendor-skill-first.
  7. Do we publish how a 6-person generative-computing team does this?

Sources (with verification status)

Verified against primary source:

  • Karpathy — Software Is Changing (Again) (YC, 2025-06-17); From Vibe Coding to Agentic Engineering (Sequoia AI Ascent 2026, 2026-04-29); Dwarkesh interview (2025-10); nanochat (HN, 2025-10-13).
  • METR — Measuring the Impact of Early-2025 AI on Experienced OSS Developer Productivity (RCT, 2025-07-10; 16 devs, 246 tasks, −19%).
  • SPACE — Fast and Spurious (arXiv 2510.24265, 2025-10-28; 415 practitioners; NSF mirror par.nsf.gov/biblio/10677745).
  • Anthropic — AI assistance and coding skill formation RCT (arXiv 2601.20245 / anthropic.com/research/AI-assistance-coding-skills, 2026-01-29; 52 engineers; 50% vs 67%).
  • OpenAI — Harness engineering: leveraging Codex in an agent-first world (openai.com/index/harness-engineering/, 2026-02-11).
  • vLLM — AGENTS.md (github.com/vllm-project/vllm; PR #36877; pure code-agent PRs banned).
  • Instructor — CLAUDE.md (github.com/instructor-ai/instructor; stacked PRs, small-PR + changelog-per-PR rules).
  • Zig — Code of Conduct No-LLM policy (ziglang.org/code-of-conduct) + Kelley JetBrains podcast (youtube.com/watch?v=iqddnwKF8HQ, 2026-05-27).
  • Thoughtworks — Technology Radar Vol 34 (2026-04-15): harness engineering, feedback flywheel, curated shared instructions (ADOPT), agent instruction bloat (CAUTION), retaining-principles/cognitive-debt, securing permission-hungry agents.
  • DORA — State of DevOps 2025 (~5,000 respondents): AI-as-amplifier; AI Capabilities Model (7 capabilities); negative stability relationship.
  • GitHub — Spec-driven development with AI (blog, 2025-09-02); Spec Kit repo (github/spec-kit, model-agnostic, 30+ agents); Well-Architected Governing agents.
  • arXiv 2602.00180 — Spec-Driven Development: From Code to Contract (2026-01-30).
  • arXiv 2602.14690 — Configuring Agentic AI Coding Tools (2,926-repo study; AGENTS.md as interoperable standard; skills shallowly adopted).
  • agents.md / Agentic AI Foundation (Linux Foundation) — AGENTS.md standard.
  • Anthropic — How we Claude Code (video, 2026-05-23); How Anthropic teams use Claude Code (report).
  • Sourcegraph — The Coding Agent Is Dead + CodeScaleBench (2026-02).
  • MSR'26 — LGTM! Auto-Merged LLM-based Agentic PRs.
  • RSTD — arXiv 2605.15425 (Asthana et al., IBM Research, built on Mellea).
  • First-party: m decompose (docs.mellea.ai); mellea-skills-compiler (generative-computing); eval stack (m eval, TestBasedEval, BenchDrift, pytest tiers); our repo metrics + skill/MCP inventory.

Leads (single source / relay / vendor-reported — verify before quoting verbatim): "lost in the woods"; Spec Kit / Kiro payoff numbers; Vercel AGENTS.md-beats-skills eval; runcycles tier model; "find the riskiest line" reviewer prompt; Martin Fowler / Böckeler SDD taxonomy (secondary but reputable); Hashimoto / Willison / Ball framings.

Flagged likely-fabricated (do NOT cite): Claude Code "Dynamic Workflows / Ultracode parameter"; Microsoft "Agent Governance Toolkit (AGT)".

Martin Fowler / BöckelerUnderstanding Spec-Driven Development: Kiro, spec-kit, and Tessl (martinfowler.com, 2025-10-15) — the clearest neutral taxonomy (spec-first / anchored / as-source).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment