You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Summary — One laptop carries ~50 agent skills. The ~14 that market mellea
are already shared in a git repo; the ~20 that build it are siloed on
individual machines, copied nowhere. This page inventories both tiers and
proposes a one-off batch to consolidate the engineering skills into the shared
library (the how is on page 2).
The split is the whole argument:
Skills that market the project are already shared.
Skills that build it are siloed on individual machines.
(Skills below are grouped as they'd sit after refactoring, ~50 on one laptop.)
flowchart LR
subgraph SHARED["Already shared (git repo) ✓"]
C["Content / dev-rel<br/>14 skills"]
end
subgraph SILOED["Siloed on laptops ✗"]
E["Engineering<br/>~20 skills"]
end
C -.->|"the asymmetry"| E
style SHARED fill:#dff0d8,stroke:#3c763d,color:#1b3a1b
style SILOED fill:#f2dede,stroke:#a94442,color:#3a1b1b
Loading
The team repo today — and why it isn't enough yet
The shared content repo hosts 14 skills (blog / tweet / LinkedIn / YouTube
drafting, release notes, research, link previews, snippet validation). Consumed
by symlink → zero drift, your copy is the base. This proves the model.
Two cracks show it isn't yet a team library:
Single-author — one person feeds it, others only read. A library with one
contributor is a personal repo with an audience.
Stale — last commit ~7 weeks ago. No inflow, no refresh — the exact drift
the learning loop (page 2) is built to kill.
Symlink-sharing works for a solo author. It does not scale to many people
editing the same skill — that's what the tiered override / augment model
(page 2) is for.
The engineering tier — siloed, never shared
Local directories on one laptop, shared with no one. Grouped by phase so adoption
order is obvious. Right column = rough effort to generalise before a skill is
team-ready (estimates — validate before quoting):
Full per-skill list with descriptions: page 1b — personal catalog.
Consolidating into a shared library — the mechanism
One laptop holds ~20 engineering skills. Multiply by the team and that tier is
the real prize — but the contributions must be consolidated, validated, and
hosted first. The bootstrap, reusing the tiered model from page 2:
List first — everyone drops their skills into a shared table (name,
purpose, owner, share? y/n). Surfaces overlaps cheaply — five people will each
have a fix-bug — before anyone writes a PR.
Consolidate — per overlapping skill, pick one seed author; the rest
contribute by review / augment, not competing PRs.
Stand up the team-base repo (or widen the content repo with an eng set)
as the Team base tier.
One PR per skill — description: frontmatter mandatory so selection
works. A rotating gardener validates (generic, no per-laptop paths, cruft
stripped), bumps the version date, merges.
Add the repo to your skill path and pull. Local tweaks become overrides;
improvements flow back as the next PR.
A one-off batch to seed the base. After it lands, it's the steady-state loop
from page 2 — not a migration project.
Summary — I have a lot of skills — ~50 on this laptop, ~40 of them
mellea-relevant, all listed below with descriptions and a status column. The
point isn't any single skill; it's the sheer volume, and that almost every
engineering one is Personal — shared with no one.
This is my own inventory, not a proposal. A point-in-time snapshot of the
skills on one machine that touch mellea / core development — reference for
the breadth on page 1. Machine, other-project, and taste skills (firewall
tooling, disk cleanup, SSH helpers, GPU sizing, etc.) are omitted.
Descriptions are condensed from each skill's description: frontmatter (the text
the agent matches against to decide whether a skill fires). The Status column
is the page-1 story made concrete — almost everything engineering is
Personal, i.e. shared with no one.
Status legend
Team — unchanged — symlinked from the shared content repo; my copy is the base.
Project — unchanged — symlinked from mellea's committed .agents/skills/.
Personal — additional — real dir on my laptop only, in no shared repo. The consolidation candidates (page 1).
Overlaid — a local override of a shared skill. None today — symlinked shared skills can't diverge; this state only appears once the team-base repo exists.
Implement
Skill
What it does
Status
fix-bug
Structured diagnose-and-fix: rebase discipline, minimal fix, regression test, related-issue check.
Personal — additional
writing-tests
Tests that catch real regressions, not coverage padding — what to test, at what level, mock + marker discipline.
Personal — additional
stacked-pr
Manage PRs built on unreleased changes: branch construction, rebase lifecycle, focused review of stacked diffs.
Personal — additional
phased-pr-strategy
Decide phased small PRs vs one monolith — and defend the choice when a reviewer pushes back.
Personal — additional
Review
Skill
What it does
Status
code-review
Multi-perspective review via 3 independent subagents synthesised into a consensus report.
Personal — additional
write-pr-body
Narrative-first PR body a cold reviewer grasps in 30s; epic-anchored "Where this fits" + testing block.
Personal — additional
respond-to-pr-comment
Triage and reply to review comments (yours or others'): claim verification, thread-resolution discipline.
Personal — additional
Design & planning
Skill
What it does
Status
design-issue
Draft / review / file a design or epic GitHub issue with correct project + iteration metadata.
Personal — additional
design-proposal
Long-form design doc / RFC for a cross-cutting change needing agreement before decomposition.
Personal — additional
prior-art-research
Competitive-landscape + best-practice scan before tech choices; includes "reasons not to build this".
Personal — additional
project-scoping
Turn a rough idea into testable UC / TR / IC requirements plus an explicit out-of-scope list.
Generate Mintlify Markdown / MDX from Python / Rust docstrings, plans, and app logic.
Personal — additional
Workflow / meta
Skill
What it does
Status
catchup
Catch up on repo activity since your last engagement; deep-dive the high-priority diffs.
Personal — additional
inbox-triage
Cross-repo overview of unread GitHub notifications; surface action items before diving in.
Personal — additional
find-skills
Discover and install agent skills when you ask "is there a skill for X".
Personal — additional
Content / dev-rel
The shared content repo is where this all started — most of these are symlinked
in, so my copy is the base. Note the two stragglers (blog-idea-scout,
blog-review) that I have locally but never pushed back: the asymmetry exists
even inside content.
Skill
What it does
Status
write-technical-blog
Write a good technical post about a feature or release (Stripe / GitHub / Cloudflare patterns).
Summary — Two ideas make this work. One — tiers. Skills live in four
layers — org → team → project → personal — and the nearest copy wins, so a
shared base can be overridden locally without forking. Two — an update loop.
Fixes found in daily use flow back to everyone, so the library improves instead
of rotting. Today only our content skills are shared; engineering skills sit on
individual laptops. The rest of this page is how the tiers and the loop work.
How the layering works
A skill is just a directory with a SKILL.md — YAML frontmatter plus a
markdown body that is the prompt. Each tier is a real directory on the agent's
skill search path:
Override = drop a same-named skill in a nearer tier.
Skills resolve most-specific-wins — the nearest copy shadows those below.
The only exception is the locked tier nobody may weaken.
flowchart TB
U["Personal — your laptop<br/>niche skills + personal overrides"]:::win
P["Project — committed to the repo<br/>project skills + overrides of the base"]
T["Team base — shared git repo<br/>engineering + content skills"]
O["Org ceiling — managed policy<br/>the few skills nobody may weaken"]:::lock
O --- T --- P --- U
classDef win fill:#dff0d8,stroke:#3c763d,color:#1b3a1b
classDef lock fill:#f2dede,stroke:#a94442,color:#3a1b1b
Loading
Personal shadows Project shadows Team base. The Org ceiling sits above all and
cannot be overridden.
Precedence is two things, not one:
Which copy you get — hard, deterministic. Every tier is scanned into one
registry; on a same-name collision the nearer tier wins, exactly like $PATH.
We pick the order. This is what the diagram shows.
Whether a skill fires at all — soft, model-driven. The agent matches the
task to each skill's description: frontmatter. Not in the diagram — but it's
why good descriptions matter.
So shadowing decides whichfix-bug you run; descriptions decide whetherfix-bug runs at all.
Two ways to specialise a skill:
Override — same-named skill in a nearer tier; replaces the base entirely.
Free — just directory shadowing.
Augment — a thin delta that includes a shared fragment instead of copying
it. A convention you author, not automatic concatenation — but nothing to sweep
back later.
Where does a skill go? One question: who needs it?
Every project, not tied to mellea → team base (fix-bug, code-review)
Only meaningful inside mellea → project (mellea-iteration-status)
Only you, or your twist on a shared one → personal
The org must not be able to weaken it → org ceiling
fix-bug shows the cascade end to end: generic version in team base;
mellea commits a uv/pytest-aware override in its own repo; you can override
again on your laptop. Nearest wins — and the only thing yet to be created is
the team-base repo. The content repo, mellea's in-repo skills, and your home dir
all exist today.
One format, every agent
Skills are authored once in .agents/skills/ using the open
agentskills.ioSKILL.md standard (our skill-author
skill enforces it). Each tool is just a view onto that one store — no per-tool
copies:
Agent
Finds skills in
Config needed
Claude Code
~/.claude/skills/ + project skillLocations
symlink ~/.claude/skills → ~/.agents/skills; "skillLocations": [".agents/skills"] in .claude/settings.json
IBM Bob
~/.bob/skills/
symlink ~/.bob/skills → ~/.agents/skills (and .bob/skills → .agents/skills per repo)
OpenCode
~/.agents/skills/
none — auto-discovered
Copilot / VS Code
project .agents/skills/
none
The canonical store at each tier is .agents/skills/; the per-tool dirs are
symlinks or one config line pointing at it. This is why "sync" isn't "copying" —
the agents share one source, they don't each hold a copy.
This is the Kustomize / Claude Code config cascade applied to skills — a
20-year-old config pattern, not a new invention.
How it incorporates learning
Improvements found in use flow back to everyone. Two nested loops at different
speeds:
flowchart LR
A[Use skill on real work] --> B{Friction?}
B -- no --> A
B -- yes --> C["improve-skill drafts a PR<br/>against the base (same session)"]
C --> D[(Open skill PRs)]
D --> E["Gardener merges<br/>one approval, bump version date"]
E --> F[Everyone re-pulls]
F --> A
Loading
Inner loop (continuous, individual) — hit friction using a skill, capture
the fix as a PR there and then. The agent that just failed is the best-placed
author.
Outer loop (per iteration, team) — a rotating gardener (a throughput
role, not a taste authority) merges the open skill PRs and bumps the version.
Everyone re-pulls.
Two health signals say the loop is alive:
Capture rate — did any friction become a PR this iteration? Zero means
people are silently working around bad skills.
Staleness — median days a clone lags the base. A checkout left to drift
weeks behind upstream is the exact failure this prevents.
The first step
One skill (fix-bug), two people, one full round-trip in one iteration:
base → both adopt → both use on real work → first friction → PR → gardener
merges → both re-pull. If that turns once, the model is proven and the rest is
scaling.
A teaser. This deserves its own meeting — please book one.
Summary — Shared skills are step one on a path from ad-hoc prompts to a
process that largely runs itself: conventions that execute rather than sit in a
wiki, knowledge that compounds across the team, and an autonomy slider set
per task-type so agents propose and humans dispose. This page sketches the
direction and asks for a follow-up; it does not try to settle it here.
Skills are our conventions made executable. Instead of a wiki page telling
people to use uv, add the right commit trailer, and mark slow tests, the agent
simply does it. The team's process stops depending on memory and starts running
itself.
Shared skills are one early, concrete step on a longer path:
We are between stages 2 and 3 today. The shared-skills proposal moves us to 3.
What "AI-led" starts to mean
Conventions execute, not just document. Standards are enforced by being
run, not by being remembered.
Knowledge compounds. Every fix to a skill improves everyone's next task,
not just the author's.
People direct; agents do the routine. We spend our judgement on what
matters and hand the repetitive work to skills that already encode how we
like it done.
What "AI-led" looks like, concretely — the autonomy matrix
The point isn't "automate everything." It's setting the autonomy slider per
task-type: push specific, bounded work into AI proposes, human disposes,
gate it, and keep human judgement on the core we know cold. A taste of the
fuller picture (detail in the linked docs):
◉ today ★ target this iteration ☆ stretch (later)
Stage
Human-led
AI proposes, human disposes
Agent-led
1 · Sense
◉
★ weekly digest agent
☆
2 · Frame
◉
★ interview-the-author
3 · Triage
◉
★ triage skill (comment-only)
☆ auto-label trivia
4 · Investigate
◉
★ reproduce + locate skills
5 · Decompose
◉ (1 person)
★ shared skill + critique
6 · Implement
◉ (novel core)
★ bounded features
☆ trivia, auto-accept
7 · Review & verify
◉
★ creator–verifier + evals
8 · Merge
◉
★ distinct-approver + CI
☆ Tier-1 auto-merge
Shared skills are what make the ★ column real: each one is a bounded task-type
encoded so the agent can propose and a human can dispose.
Why a separate meeting
This page is deliberately thin. The bigger picture — how far AI-led development
goes, what we automate next, where the human stays in the loop, and what it
means for how we plan and review work — is a conversation, not a slide.
The ask: agree the shared-skills pilot now (see pages 1–2), and book a
follow-up to talk through the bigger picture.
Full background: the deeper strawman and research notes sit in the same
gist as this set — see the team-session gist.
Diagram note: these render from Mermaid as-is in GitHub, VS Code, and most
slide tools. For a polished graphic, export from mermaid.live
or hand the concept to a designer.
Skim layer for Wed 2026-06-03. The argue-with layer is concept-to-impl-strawman.md;
the evidence is concept-to-impl-research.md. Diagrams render in any Mermaid-aware
viewer (GitHub, VS Code, Obsidian).
The idea in one breath
Today every stage of our lifecycle runs in one mode — a human drives, AI
assists. The move is to push specific, bounded task-types into AI proposes
/ human disposes, gate them, and spend the freed attention on the work that
needs a brain. Not "automate everything" — set the autonomy slider per task.
The sober caveat (METR RCT): experienced devs on familiar code went 19%
slower with AI. So agents pay off on the periphery, not the core we know
cold. A 415-developer survey (SPACE, Fast and Spurious) corroborates this —
GenAI's speed gains get swallowed by review burden and verification load, so the
gains "may be spurious." The whole strawman is about drawing that line on purpose.
Blue = stages where the strawman pushes a bounded task-type into AI-proposes
this iteration. Grey = stays human-led (the off-distribution core).
The autonomy matrix
◉ today ★ target this iteration ☆ stretch (later)
Stage
H · human-led
P · AI proposes, human disposes
A · agent-led
1 · Sense
◉
★ weekly digest agent
☆
2 · Frame
◉
★ interview-the-author
3 · Triage
◉
★ triage skill (comment-only)
☆ auto-label trivia
4 · Investigate
◉
★ reproduce + locate skills
5 · Decompose
◉ (1 person)
★ shared skill + critique
6 · Implement
◉ (novel core)
★ bounded features
☆ trivia, auto-accept
7 · Review & verify
◉
★ creator–verifier + evals
8 · Merge
◉
★ distinct-approver + CI
☆ Tier-1 auto-merge
Autonomy belongs to the task-type, not the stage: a docs typo is async, the
sampling loop is synchronous — same "Implement" row.
The loop inside every stage (vendor-neutral: spec-driven development)
flowchart LR
A[Specify<br/>what & why] -->|gate| B[Plan<br/>how]
B -->|gate| C[Tasks<br/>small, ordered]
C -->|gate| D[Implement<br/>focused diffs]
D -.->|verify fails| C
K[(AGENTS.md<br/>constitution)] --- A
K --- B
K --- C
K --- D
Loading
Same shape across GitHub Spec Kit / AWS Kiro / Tessl. We already ship the Tasks
engine (m decompose); "interview-the-author" is /specify + /clarify. Golden
rule: minimum spec rigour that removes ambiguity.
The harness (Thoughtworks' frame: feedforward + feedback)
flowchart LR
FF["FEEDFORWARD — aim it<br/>AGENTS.md · skills · m decompose · specs"] --> AG((agent runs))
AG --> FB["FEEDBACK — catch it<br/>ruff · mypy · m eval · pytest tiers"]
FB -.->|self-correct before human| AG
FB --> H[human review]
Loading
Both halves already exist in Mellea. The work is wiring them into a loop.
Merge gate (the one-directional industry consensus)
flowchart TD
PR[Agent opens PR] --> CI{CI gates pass?}
CI -->|no| FIX[back to author]
CI -->|yes| APV{Independent approver?<br/>author ≠ approver}
APV -->|no| WAIT[blocked]
APV -->|yes| DEL{Deletes code?}
DEL -->|yes| EXTRA[extra scrutiny]
DEL -->|no| MERGE[merge]
EXTRA --> MERGE
Loading
All native GitHub branch-protection — free on a public repo. GitHub: "the
developer who asks the agent to open a PR cannot be the one to approve it." MSR'26:
77.5% of agentic PRs were self-merged (the dominant risk). A peer LLM library,
vLLM, goes further — its AGENTS.md states "a human submitter must understand and
defend the change end-to-end" (pure code-agent PRs banned).
What we actually do next
flowchart LR
subgraph NOW[Do now - no debate]
N1[Promote 4 skills canonical]
N2[Vendor dev-rel-skills pinned]
N3[fewer-permission-prompts]
N4[.bob/mcp.json]
N5[Branch protection]
N6[End-of-session skill habit]
N7[Triage external bugs 24h]
N8[Second model on the proxy]
end
subgraph NEXT[Next iteration]
X1[Triage skill]
X2[Decomposition skill]
X3[Reproduce + locate]
X4[Frame-issue interview]
X5[Pre-merge-readiness]
X6[Commit hygiene + linters]
X7[Skill evals]
end
subgraph DECIDE[Decide in the meeting]
D1[Periphery/core line]
D2[Decomp: shared vs specialist]
D3[Review norm + rota SPOF]
D4[Smallest auto-mergeable unit]
D5[Process metrics]
D6[Publish as case study]
end
NOW --> NEXT --> DECIDE
Loading
#
Action
Stage
Effort
Owner
Decide
1
Promote 4 skills to project-canonical (writing-tests, code-review, respond-to-pr-comment, write-pr-body)
cross-cut
afternoon
skills owner
now
2
Vendor / submodule dev-rel-skills, commit-pinned
cross-cut
½ day
skills owner
now
3
fewer-permission-prompts → tracked settings.json
cross-cut
½ hr
settings owner
now
4
Wire .bob/mcp.json (git, github-ibm)
cross-cut
15 min
Bob users
now
5
Branch protection: independent approver + same CI + provenance line
8
15 min
repo admin
now
6
End-of-session "improve the skill / AGENTS.md" habit
cross-cut
free
team norm
now
7
External bug reports triaged ≤ 24h + count
3
decision
team norm
now
8
Try a second model on the proxy (deliberate Codex/Gemini second-opinion or long-context pass, via the same LiteLLM proxy we already run Claude through)
Decide in meeting: periphery/core line (METR+SPACE), decomp shared-vs-specialist, review norm + reviewer rota (planetf1 SPOF), smallest auto-mergeable unit or nothing (vLLM vs Zig), process metrics, publish-as-case-study
cross-cut
—
—
meeting
Argue about these
#
Tension
Evidence
1
Where's our periphery/core line — what do we actually trust to an agent here?
METR: −19% on familiar core
2
Decomposition: shared skill or specialist's craft?
1 person + AI does ~all
3
Close the external-triage gap how?
#775/#885/#911 sit 30–60d
4
Change the review norm to kill the long tail?
26-commit PRs = fatigue
5
Skills canonical or personal? AGENTS.md-first or vendor-skill-first?
17 silent symlinks; AGENTS.md is the open standard
6
Smallest unit an agent may merge — or nothing?
Consensus: don't auto-merge unreviewed
7
Which process metrics?
Baselines: TTM median 1.3d, p90 8.4d
8
Publish how a 6-person gen-computing team does this?
Mellea: Concept → Implementation — a strawman to argue with
For the team session, Wed 2026-06-03. This is a strawman, not a proposal I'm
attached to. Read it to disagree with it. The goal is to leave with 2–3 things
we actually change this iteration, and a shared map of where we're heading.
The one-sentence problem
We use AI heavily and well, but every stage of our lifecycle sits in the same
mode — a human drives, AI assists, whoever picked up the work decides how.
That's fine, and it's also the ceiling. The models are now good enough that some
of this work doesn't need a human in the driving seat — and some of it needs us
more, not less. We don't currently distinguish.
The thesis
Karpathy's 2026 name for where we're heading is agentic engineering: "vibe
coding raised the floor; agentic engineering raises the ceiling" — using agents
to go genuinely faster without dropping the quality bar ("you are still
responsible for your software just as before"). The mechanism is an autonomy
slider: "you are in charge of the autonomy slider, and depending on the
complexity of the task at hand, you can tune the amount of autonomy that you're
willing to give up." You keep the AI on a
leash as tight as the task warrants (unconstrained, they get "lost in the woods")
— "it's not useful to me to get a diff of 10,000 lines of code… I'm still the
bottleneck." The goal is the Iron Man suit
(augmentation with a fast human verification loop), not the Iron Man robot that
runs off alone — "more building partial autonomy products… so that the
generation-verification loop of the human is very, very fast." This isn't one
lab's idea — it's the convergent industry signal. Thoughtworks' April 2026 Tech
Radar names the same discipline "harness engineering": putting agents on a
leash with feedforward controls that aim them (skills, specs, shared
instructions) and feedback controls that catch them before a human does
(linters, type-checkers, evals). Anthropic's own teams are one concrete instance
— they split work into asynchronous ("auto-accept mode", the agent runs and you
review the result) and synchronous ("detailed prompts… for core business logic",
supervised turn by turn) — but we use them as an illustration, not the authority.
The sober part, up front. The autonomy story is real but uneven, and the best
controlled evidence is a caution, not a cheer. METR's 2025 RCT put 16 experienced
open-source developers on their own mature repos and measured them 19%
slower with AI — while they believed they were ~20% faster. The slowdown
tracked exactly the conditions we live in: high repo familiarity, large complex
codebase, time spent verifying unreliable suggestions. The honest read isn't "AI
makes us faster"; it's "AI pays off on the periphery and on unfamiliar code, and
can cost us time on the core we know cold" — which is the whole reason the slider
matters. This isn't just our anecdote: SPACE's measured "Fast and Spurious"
survey (415 practitioners) found GenAI's speed and output gains are offset by a
heavier code-review burden and the cognitive load of verifying output — "gains may
be spurious." And there's a skill-atrophy signal to watch — Anthropic's RCT found
AI-assisted engineers scored 50% vs 67% on a comprehension quiz (d=0.738). DORA
2025 says it more bluntly: AI is "an amplifier" — it magnifies the discipline (or
the mess) we already have.
So picture our lifecycle as a grid: rows are the stages from "notice a need" to
"release"; the columns are the slider — who drives:
H — Human-led, AI assists. You drive. AI is a power tool. Where we all
are today, for everything.
P — AI proposes, human disposes. The agent produces a candidate — a
triaged issue, a decomposition, a review — and a human approves it through a
gate. The human's job shrinks to judgement.
A — Agent-led (async). The agent runs to completion on a tightly bounded
class of work; the human reviews output, or a rule lets it through.
The move is not "automate everything". It's to push specific, bounded
task-types leftward into P, a few trivia into A, and to spend the freed human
attention on the things that genuinely need a brain: ambiguous framing,
architectural decomposition, core-logic review, and the merge button.
The matrix
◉ = where we are today ★ = strawman target ☆ = stretch (later)
Stage
H · human-led
P · AI proposes, human disposes
A · agent-led
1 · Sense a need
◉
★ weekly digest agent
☆
2 · Frame the issue
◉
★ Claude interviews the author
3 · Triage
◉
★ triage skill (comment-only)
☆ auto-label trivia
4 · Investigate
◉
★ reproduce + locate skills
5 · Decompose epics
◉ (one person)
★ shared skill + critique pass
6 · Implement
◉ (novel core)
★ bounded features
☆ trivia, auto-accept
7 · Review & verify
◉
★ creator–verifier + agent-native verify
8 · Merge & release
◉
★ distinct-approver + same CI gates
☆ Tier-1 auto-merge
The same picture as a flow — blue = stages where the strawman pushes a bounded
task-type into AI-proposes this iteration; grey = stays human-led (the
off-distribution core):
Two things to notice. First, nearly everything we do is in the leftmost
column — that's the finding, not a criticism. Second, the autonomy level
belongs to the task-type, not the stage. "Implement" isn't async or sync; a
docs typo is async, a change to the sampling loop is sync. The grid is a
decision aid for each piece of work, not a fixed rota.
The loop inside every stage (the vendor-neutral "how")
The matrix says who drives each stage; it doesn't say how the work runs once
an agent is involved. The whole industry has converged on one answer, and it's
deliberately not a Claude-specific one — spec-driven development:
Specify → Plan → Tasks → Implement (human review gate at every →)
Specify the what & why, plan the how, break it into small dependency-ordered
tasks, implement them as focused diffs you can actually review — with a human gate
at each boundary and a "constitution" of durable project rules (ours is
AGENTS.md). GitHub Spec Kit (open-source, model-agnostic, 30+ agents),
AWS Kiro and Tessl are all the same shape; Microsoft, Anthropic and Google
converged on Spec Kit as the interoperable layer. The golden rule is
"use the minimum specification rigour that removes ambiguity" — spec-first for
most work, spec-anchored for long-lived core.
We already ship the engine for the Tasks phase — m decompose — and our
"interview-the-author" framing (stage 2) is just /specify + /clarify. So
adopting the loop is mostly wiring our own parts together, not importing a vendor
workflow.
What being 6 people on an OSS LLM library rules out
A lot of the 2026 literature is written for large product teams with hosted
services, paid agent fleets, and dedicated platform engineers. We are not that,
and the strawman is deliberately scoped down to fit what we are:
No parallel agent fleets. Cursor runs "hundreds"; Factory warns they
conflict and runs serially. At 6 people on a shared main, serial-with-good-
worktrees is the right default — fleet orchestration is a non-goal.
No persistent watcher / runtime-authority agents. We have no prod
telemetry to watch (we're a library), and a 6-person repo doesn't need a
standing autonomous service with its own session-authority budget. Sensing is
a periodic digest, not a daemon.
No bespoke governance engine. GitHub branch-protection gives us
author≠approver and CI gates for free on a public repo; we don't build a
runcycles-style authority broker — we cite it as the direction, adopt the free
90%.
Verification is evals, not a web-app harness. The Anthropic three-beat is
framed for UIs (DOM contracts, Playwright). Our equivalent is our own eval
stack (m eval, TestBasedEval, BenchDrift, pytest tiers) — and it happens to
be exactly the thing Mellea exists to provide.
The bias everywhere: adopt the free 90%, cite the expensive 10% as direction.
Walking the stages
Each one: where we are → what I'd propose → and the evidence or precedent.
1 · Sense.Today — we over-serve this with personal scouting skills
(hn-scout, blog-idea-scout, research-project) but nothing systematically feeds
the backlog. Proposed (P) — one weekly digest agent watching the Granite
ecosystem, our dependents, and the paper/HN feeds, producing "things Mellea
might want to respond to" that seed iteration planning. Caveat for us — we're
a library, not a hosted product, so we have no prod telemetry to watch; the
Devin Auto-Triage "watch the error stream" shape doesn't map. Our signal is
external (ecosystem releases, dependents, issues, papers), which is exactly what
our scouting skills already read — so this is the cheapest stage to make
periodic. Precedent — Devin Auto-Triage proposes-never-merges (May 2026);
Cherny: "Claude is starting to come up with ideas."
2 · Frame.Today — issues are written free-hand; quality varies wildly.
Proposed (P) — a flow where Claude interviews the author into a
well-formed issue (problem-not-solution, reproducer, scope) before it's filed.
Precedent — this is Anthropic's first beat: "the requirements are latent
within you; Claude is better at extracting them than you are at stating them."
Cheap, no infrastructure, high leverage.
3 · Triage.Today — internal issues get a same-day response; external
bug reports sit 30–60 days (#775, #885, #911 among them). Proposed (P) — a
triage skill that classifies (bug/feat/docs/area), proposes labels + a size +
linked duplicates, and posts a comment for a human to approve — it does not
silently mutate the tracker. Plus a norm: every external report triaged within
24h, and we track that number. Precedent — Devin Auto-Triage's read-only-first
shape.
4 · Investigate.Today — our barest stage; only a partial fix-bug skill.
Proposed (P) — reproduce-bug and locate (call-stack / file-finder) skills,
paired with fix-bug. Precedent — Anthropic's API and Inference teams use
Claude as the "first stop" to find which files a task touches: seconds, instead
of a colleague round-trip.
5 · Decompose.Today — it works, but it's one person. planetf1 + AI
decomposed #929 and #891 into clean phase/wave sub-issues, each in a single
sitting. That's a single point of failure dressed up as a strength. Proposed
(P) — generalise it into a team-shared decomposition skill with a built-in
critique pass, so any reviewer can drive it. We already ship the engine for
this: m decompose parses a prompt into dependency-ordered subtasks, extracts
the constraints, tags each one "code" vs "llm"-judge validation, and emits a
runnable m.instruct() script + JSON — so the shared skill wraps our own CLI and
adds the critique pass on top, rather than reinventing decomposition. Precedent
— Factory Droid's coordinator/critique separation. Also worth a look — our
design-via-draft-PR
ceremony is heavy (#1080 was 1,813 lines across 29 commits for a single-shot
artefact). Anthropic's second beat suggests a denser, click-through artefact
(even HTML) as a lighter feedback surface than a giant markdown PR.
6 · Implement.Today — strong, Mellea-specific, and already ~37% of our
commits carry an AI trailer. Proposed — make the async-vs-sync call
explicit per task: bounded/peripheral work (edge cases, tests, docs) goes to P
or A (auto-accept loops, checkpoint from a clean git state — Anthropic's DS team:
"treat it like a slot machine… let it run for 30 minutes, then either accept the
result or start fresh"); core logic stays in H with real-time supervision.
The boundary that matters — Karpathy's rule (Sequoia 2026): "you're either in
the data distribution, on the rails of the RL circuits, and flying, or you're
off-roading in the jungle with a machete." Agents are on-rails for code that
recurs online and that labs train against (verifiable + commercially valuable);
they go off-road on novel, precisely-arranged code — which is why he hand-wrote
nanochat ("too far off the data distribution"). Much of Mellea's core is
exactly that off-distribution code — the sampling loop, context management,
the generative-function machinery, novel intrinsic/adapter wiring — so it stays
H. Our P/A payoff is the periphery that does occur all over the internet: a new
backend that mirrors an existing one (HF/OpenAI/Ollama/Watsonx/LiteLLM all share
a shape), test scaffolding, docs, example files, telemetry-field plumbing.Precedent — Vim mode was "roughly 70%… Claude's autonomous work"; RL team: "try
one-shot first… works about a third of the time."
7 · Review & verify.Today — strong review skills, but our long tail is
review fatigue, not complexity: 26-commit PRs where the author makes fix-up
commits in response to feedback (squash-merge hides the cycle count). This review
burden is an industry finding, not just ours — SPACE measures it across 415 devs;
and a peer LLM library, vLLM, goes further and bans pure-agent PRs ("a human
must understand and defend the change end-to-end"). Proposed
(P) — (a) a creator–verifier split: a fresh-context review subagent that
sees only the diff + criteria, not the reasoning that produced it (Factory's
two-pass pipeline; Anthropic's adversarial /code-review), with apply-mode — in
practice as concrete as the standing prompt the X practitioner community
converged on: "find the riskiest line; name the missing test";
(b) agent-native verification grounded in our own tooling — we're an LLM
library, not a web app, so the "verify" surface isn't a DOM dashboard, it's
evals. The same requirement runs off one definition three ways: m eval run
/ TestBasedEval (LLM-as-judge) for behaviour, BenchDrift for prompt/variation
robustness, and the pytest tier suite (unit/integration/e2e/qualitative)
headless in CI; (c) a commit-hygiene norm to kill the 20-commit tail; (d) extra
scrutiny on any PR that deletes code (MSR'26). Precedent — Anthropic's third
beat ("verify, not test"), translated to our domain. This is the most on-brand
slice we have: Mellea is a requirements + verification framework — making an
agent verify its own change with m eval is dogfooding our own thesis, not
importing someone else's web-app pattern. And there's a published result behind
it: RSTD (arXiv 2605.15425, May 2026 — IBM Research, built on Mellea) invokes the LLM only as
narrowly-scoped judgment operators with schema-validated outputs, and on a
validation failure issues a targeted repair prompt rather than re-running the
whole task — exactly Mellea's Instruct-Validate-Repair loop — cutting retry cost
73.2% vs static decomposition and 51.7% vs monolithic at ~18% framework
overhead. Our verify-and-repair pattern isn't just on-brand, it's measured.
8 · Merge & release.Today — entirely human. The industry consensus here
is unusually one-directional: don't auto-merge unreviewed, and the author may not
approve their own change. GitHub enforces it ("the developer who asks the agent to
open a pull request cannot be the one to approve it"); MSR'26 found 77.5% of
agentic PRs were merged by the submitter and flags it as the dominant risk;
Devin and Factory keep human merge authority as a design invariant; the peer
library vLLM makes it explicit — a human must understand and defend the change
end-to-end. Proposed (P
— cheap and consensus-backed) — a branch ruleset requiring an independent
approver (requires_distinct_approver), the same CI gates for agent PRs as
human ones, an agent-provenance field in the PR template, and extra scrutiny
on code-deleting PRs. All of this is native GitHub branch-protection — free on
a public repo, no infrastructure to build or run.Stretch (A — argue about it) — auto-merging a tiny class
(doc typos, dependency bumps with green CI) is further than Devin, Factory and
GitHub will go today; bring it as a tension, not a default. Precedent —
runcycles.io tier model; GitHub Well-Architected "governing agents"; MSR'26
LGTM!.
Application to us — what our GitHub actually shows
The matrix is generic; this part isn't. I pulled the last ~90 merged PRs
(window from 2026-03-01) to ground the three challenges we named — who/when to
review, how to split large work, better tool use — in our own data rather than
industry anecdote.
Who/when to review — the merge button is loose, the reviewer is a SPOF. Good
news first: only 1 of 90 PRs had no external review — we do review each
other's work. But 76 of 90 (84%) were merged by the author themselves, and
planetf1 reviewed essentially all 90 (jakelorocco 55, ajbozarth 53, psschwei
36 behind). So the review habit is healthy; the merge-authority habit is not
(industry consensus: author≠approver), and review load sits on one person. The
cheap fixes map straight onto stage 8: a branch ruleset for independent approver +
a lightweight reviewer rota so planetf1 isn't the bottleneck for every change.
Splitting large work — a long tail we already feel. Median PR is +58
lines — small and reviewable. But the tail is heavy: p90 +1,220, max +2,436,
and 14 PRs over +500 lines. Those are the 20-/26-commit fix-up marathons. This
is the concrete case for the decomposition skill (stage 5) and the commit-hygiene
norm (stage 7): the median shows we can ship small; the tail shows we don't
always choose to. There's peer precedent for the discipline: Instructor's
CLAUDE.md tells its agents to "keep PRs small and focused" and to "use stacked
PRs for complex features."
Better tool use — we own a frontier toolbox; are we using its breadth? The
one signal I can measure is commit trailers: of AI-attributed commits, almost
all credit Claude (Bob a distant second; Codex/GPT and Gemini barely appear).
Big caveat: a trailer is an attribution marker, not a usage meter — Codex and
Gemini work that simply isn't trailer-marked is invisible here, so this is not
evidence those models are unused. (If we want the real answer, LiteLLM proxy logs
would tell us; I'm not going to guess at proxy usage.) What the trailer data does
suggest is a habit worth examining: when we mark AI help, it's overwhelmingly one
model. The cheap experiment — no new tool, no new spend — is to deliberately
reach for a second model where its shape fits: Codex/GPT for a second-opinion
review pass, Gemini's long context for whole-subsystem investigation (stage 4), the
batch-cluster models for offline eval/judge runs that don't need to be
interactive.
These three are the spine of the "tangible steps" half of the session — each one
is a quick win with a number behind it.
The cross-cutting one: skills as a product
The thing the inventory surfaced that I didn't expect: our skills are
drifting silently. 17 of mine are symlinks into a colleague's clone — a
git pull in someone else's repo changes how my agent behaves, with no notice.
Team skills vary person to person. Bob's MCP config is empty, so Bob users are
flying with less.
This is exactly where the vendor-neutral evidence is strongest, so we don't have
to lean on Anthropic for it. Thoughtworks' April 2026 radar puts "curated
shared instructions for software teams" at ADOPT and calls the thing we're doing
today — every developer hand-rolling their own prompts and skills — "an emerging
anti-pattern"; the fix it recommends is anchoring AGENTS.md into a shared
template. And the portable unit is a real open standard: AGENTS.md (introduced
by OpenAI Aug 2025, donated to the Linux Foundation's Agentic AI Foundation Dec
2025 alongside MCP; 60K+ repos, 10+ native agents by Mar 2026). A 2,926-repo study
(arXiv 2602.14690) found context files dominate and skills are mostly shallow,
static instructions — and Vercel's own eval found repo-level AGENTS.md context
beat tool-specific skills.
Proposal: treat shared AGENTS.md + skills + MCP config as a versioned product
with evals, ground it in the portable AGENTS.md standard first (works across
Claude, Bob, Codex, Gemini, Antigravity) with vendor skills as an optimisation
layer on top, and adopt an end-of-session "refine the skill / AGENTS.md" habit
(Thoughtworks calls this the feedback flywheel — a retrospective for the
harness). Most of the quick wins below live here. (Karpathy's "install .md
skills, not .sh scripts" is the same signal — the skill is the interface now,
so it deserves the rigour we'd give any shipped artefact.)
We even have a first-party tool for the governed version of this: mellea-skills-compiler (generative-computing, Apr 2026) compiles a .md skill spec into
a typed, instrumented Mellea pipeline (mellea-skills compile / /mellea-fy),
then mellea-skills certify runs it through Granite Guardian + NIST AI RMF checks
and emits a PolicyManifest + JSONL audit trail. That's "skills as a governed,
evaluated, versioned product" already realised in our own ecosystem — the
question for the meeting is how much of that rigour we adopt for the team's shared
skills now, versus later.
The feedback loop is the habit that makes all of this compound. The single
highest-leverage thing already in practice on the team is asking the agent, at the
end of a session, "what did we learn that should be written down — a skill, an
AGENTS.md rule, a gotcha?" and then actually committing the answer. That is the
feedback flywheel (Thoughtworks, ADOPT) made concrete: every session that hits
friction leaves the harness a little sharper, so the next agent doesn't repeat the
mistake. It costs nothing and needs no infrastructure — it's a norm, not a tool.
Three patterns worth standardising:
Capture-on-exit. End-of-session retro prompt → diff to a skill / AGENTS.md
/ CLAUDE.md. Make it a step in respond-to-pr-comment and the PR-wrap flow so
it isn't reliant on one person remembering.
Recurring-review-comments → automation. Periodically ask an agent for the
top-N review comments we keep making, then promote the mechanical ones into
linters / a reviewer config so they stop costing human review cycles. (Federico
Paolinelli does exactly this on MetalLB; it's the cleanest small-team instance
of the flywheel feeding the feedback half of the harness — though note his talk
is an idiomatic-Go cognitive-load talk, not an AI-agent one, cited here for the
cognitive-load framing only.)
Agent memory — experimental, low-prescription. Persistent cross-session
memory (project notes, decisions, branch/PR state) clearly helps but is hard
to prescribe — it drifts, goes stale, and what's worth keeping is a judgement
call. The honest team position: encourage individual experimentation, share what
sticks, but only promote a pattern to a team norm once it's earned it. The
durable stuff belongs in AGENTS.md/skills (reviewable, versioned); memory is
the scratch layer for the things not yet stable enough to commit.
The bigger 2026 signal sits underneath this: the codebase, not the model, is
the bottleneck. Sourcegraph's CodeScaleBench shows agents degrade past ~400K
LOC, and that wiring in code-intelligence/MCP retrieval gave a +0.26 reward delta
while running 30% cheaper and 38% faster — "the difference between complete
failure and near-perfect completion wasn't intelligence, it was efficient access
to context." DORA 2025: AI is "an amplifier, magnifying an organization's existing
strengths and weaknesses." So our highest-value paid bet is code-intelligence /
MCP retrieval wired into Investigate; the rest is making Mellea legible to
models — a tight CLAUDE.md (pointers + gotchas, <200 lines), per-package
context, LSP symbol search.
What we could ship this iteration (≈ a day, total)
Not perfect, just progress:
Promote 4 skills to project-canonical — writing-tests, code-review,
respond-to-pr-comment, write-pr-body into mellea/.agents/skills/. Now
everyone gets the project's conventions on git pull. (an afternoon)
Vendor / submodule dev-rel-skills into a Mellea-owned, commit-pinned
location — kills the silent-drift ring entirely. (half a day)
Run fewer-permission-prompts and fold the canonical permissions into a
tracked settings.json. (half an hour)
Wire .bob/mcp.json (git, github-ibm at minimum) so Bob users are on
par. (15 min)
Norm: external bug reports triaged within 24h, and start counting. (no
tooling, just a decision)
Adopt the end-of-session "improve the skill" habit.(free)
Turn on branch protection: independent approver + same CI gates for agent
PRs, and add an agent-provenance line to the PR template. (15 min — and
it's the single highest-consensus practice in the 2026 industry.)
Try a second model on the proxy. One deliberate Codex (GPT) or Gemini pass
as a second-opinion reviewer / long-context investigator this iteration — same
LiteLLM proxy we already run Claude through, just a model we reach for less.
(per-PR, no setup)
What I want us to actually argue about
These are the tensions; I have evidence for each, not answers.
#
Tension
The evidence
1
Is decomposition a shared skill or a specialist's craft?
One person + AI does ~all of it today
2
How do we close the external-triage gap?
#775, #885, #911 sit 30–60 days
3
Do we change our review norm to kill the long tail?
26-commit PRs are fatigue, not complexity; SPACE measures review burden across 415 devs; Instructor mandates small, focused PRs
4
Do skills go canonical, or stay personal?
17 silent symlinks; 2026 signal = skills-with-evals
5
What's the smallest thing an agent may merge alone — or is the answer nothing?
Consensus is "don't auto-merge; enforce author≠approver" (GitHub; MSR'26: 77.5% self-merged). The poles: vLLM (human must defend end-to-end) vs Zig (bans AI contributions outright — "invariably garbage", "contributor poker")
6
Which process metrics do we adopt?
We now have baselines: TTM median 1.3d, p90 8.4d
7
Do we publish how a 6-person team does this?
We're literally a generative-computing project
8
Parallel agent fleets, or serial?
Cursor runs hundreds; Factory runs serially ("agents conflict") — at 6 people, probably serial
9
Do we spread review load off one person, and use the breadth of models we already pay for?
planetf1 reviewed ~all 90 PRs; AI trailers credit Claude almost exclusively (a habit signal, not a usage meter)
Conclusions & actions
Three things I take away from all of the above, then the master action list.
1. We are all in one column — that's the opportunity, but the controlled evidence
says move carefully. Every stage runs human-led today; the win is pushing
bounded, peripheral task-types into "AI proposes / human disposes" behind a gate.
The two best controlled studies — METR's −19% on familiar code, SPACE's "spurious"
gains swallowed by review burden — agree: the payoff is on the periphery and on
unfamiliar code, not the off-distribution core we know cold. Draw that line on
purpose.
2. The cheapest wins are governance and skills-as-a-product, and they're
high-consensus. Author≠approver is free on a public repo and backed by everyone —
GitHub, MSR'26, and peer libraries like vLLM that flatly ban pure-agent PRs. Our
skill drift is real, measurable, and fixable this iteration. None of it needs new
infrastructure — it needs a decision.
3. The deep bets are where our own tools already put us ahead.
Decomposition-as-a-shared-skill (we ship m decompose), evals-as-verification (we
are an eval framework; RSTD measured the repair-not-rerun payoff), and codebase
legibility (the real bottleneck) are where Mellea is better positioned than most.
The strategic move is to dogfood our own stack, not import a vendor's.
The master action list
The shape of it — three lanes, left to right:
flowchart LR
subgraph NOW[Do now - no debate]
N1[Promote 4 skills canonical]
N2[Vendor dev-rel-skills pinned]
N3[fewer-permission-prompts]
N4[.bob/mcp.json]
N5[Branch protection]
N6[End-of-session skill habit]
N7[Triage external bugs 24h]
N8[Second model on the proxy]
end
subgraph NEXT[Next iteration]
X1[Triage skill]
X2[Decomposition skill]
X3[Reproduce + locate]
X4[Frame-issue interview]
X5[Pre-merge-readiness]
X6[Commit hygiene + linters]
X7[Skill evals]
end
subgraph DECIDE[Decide in the meeting]
D1[Periphery/core line]
D2[Decomp: shared vs specialist]
D3[Review norm + rota SPOF]
D4[Smallest auto-mergeable unit]
D5[Process metrics]
D6[Publish as case study]
end
NOW --> NEXT --> DECIDE
Loading
Do now — no meeting needed, owners suggested:
#
Action
Stage
Effort
Owner
Decide
1
Promote 4 skills (writing-tests, code-review, respond-to-pr-comment, write-pr-body) to project-canonical
cross-cut
afternoon
skills owner
now
2
Vendor/submodule dev-rel-skills, commit-pinned
cross-cut
½ day
skills owner
now
3
Run fewer-permission-prompts → tracked settings.json
cross-cut
½ hr
settings owner
now
4
Wire .bob/mcp.json (git, github-ibm)
cross-cut
15 min
Bob users
now
5
Branch protection: independent approver + same CI for agent PRs + provenance line
8
15 min
repo admin
now
6
End-of-session "improve the skill / AGENTS.md" habit
Try a second model on the proxy — one deliberate Codex (GPT) or Gemini pass as second-opinion reviewer / long-context investigator, via the same LiteLLM proxy we already run Claude Code through
4/7
per-PR
anyone
now
Next iteration — decide to do them now, assign owners in the meeting:
#
Action
Stage
Effort
Owner
Decide
9
Triage skill (comment-only)
3
—
assign in meeting
now
10
Shared decomposition skill wrapping m decompose + critique pass
Where to draw the periphery/core line (METR + SPACE)
5/6
—
team
meeting
18
Decomposition: shared skill vs specialist's craft
5
—
team
meeting
19
Review-norm change + reviewer rota (planetf1 is the review SPOF; SPACE)
7
—
team
meeting
20
Smallest auto-mergeable unit — or nothing (vLLM vs Zig poles)
8
—
team
meeting
21
Which process metrics we adopt
cross-cut
—
team
meeting
22
Publish-as-case-study
cross-cut
—
team
meeting
The onepager (concept-to-impl-onepager.md) holds the visual version of this
action set — now → next → decide — at a glance.
Appendix — the numbers behind the claims
Mellea, last 60 days (2026-04-01 → 06-01): 207 issues opened / 230 closed
(net backlog +94). 161 PRs merged (Apr 97, May 64). Time-to-merge: median 1.3d,
mean 3.5d, p90 8.4d, max 34.3d. Four people merged 67% of PRs. ~37% of commits
carry an AI trailer (a floor — not everyone marks consistently).
Merged-PR analysis (≈90 PRs, window from 2026-03-01):84% self-merged (76/90)
yet only 1/90 had no external review — review habit healthy, merge-authority
habit loose. Reviewer load concentrated: planetf1 reviewed ~all 90 (jakelorocco
55, ajbozarth 53, psschwei 36). PR size: median +58 lines, p90 +1,220, max
+2,436, 14 PRs over +500. AI-commit trailers: Claude ≈222, Bob 27, Codex/GPT
1–2, Gemini 0 — but trailers measure attribution marking, not model usage, so
this is a habit signal, not proof the other models are idle (proxy logs would
settle it).
Vendor-neutral sources (the industry-led spine): METR — Measuring the Impact
of Early-2025 AI on Experienced OSS Developer Productivity (RCT, 2025-07-10,
−19%); Thoughtworks — Technology Radar Vol 34 (2026-04-15: harness engineering,
curated-shared-instructions ADOPT, feedback flywheel, agent-instruction-bloat
CAUTION); DORA — State of DevOps 2025 (AI-as-amplifier; 7-capability model);
GitHub — Spec-driven development + Spec Kit (model-agnostic, 30+ agents); arXiv
2602.00180 (spec-driven development) + 2602.14690 (AGENTS.md as interoperable
standard); AGENTS.md / Agentic AI Foundation (Linux Foundation); SPACE — Fast and
Spurious (arXiv 2510.24265; 415 practitioners, review burden offsets speed gains);
Anthropic — skill-formation RCT (arXiv 2601.20245; 50% vs 67% comprehension,
d=0.738). Peer LLM libraries: vLLM — AGENTS.md ("Pure code-agent PRs are not
allowed. A human submitter must understand and defend the change end-to-end.",
github.com/vllm-project/vllm); Instructor — CLAUDE.md ("Use stacked PRs for
complex features"; "Keep PRs small and focused", github.com/instructor-ai/instructor);
Zig — Code of Conduct (bans AI contributions outright: "invariably garbage",
"contributor poker", ziglang.org).
Key 2026 sources: Karpathy — Software Is Changing (Again) (YC, 2025-06-17),
From Vibe Coding to Agentic Engineering (Sequoia AI Ascent 2026, 2026-04-29) +
Dwarkesh interview (2025-10); Anthropic — How we Claude Code (Claude channel,
2026-05-23) and How Anthropic teams use Claude Code (report); runcycles.io —
When Coding Agents Press Merge; Sourcegraph — The Coding Agent Is Dead +
CodeScaleBench; GitHub — Well-Architected Governing agents; Cognition — Devin
Auto-Triage; Factory — Droid review/creator-verifier; MSR'26 — LGTM!;
Cherny/Wu on Lenny's Podcast. First-party / Mellea:m decompose (docs.mellea.ai);
mellea-skills-compiler (github.com/generative-computing/mellea-skills-compiler);
RSTD — Runtime-Structured Task Decomposition for Agentic Coding Systems
(arXiv 2605.15425; Asthana et al., IBM Research, built on Mellea).
Verified quotes (checked against primary source, for the team-share):
Autonomy slider (Karpathy, "Software 3.0", 2025-06-17): "you are in charge
of the autonomy slider, and depending on the complexity of the task at hand,
you can tune the amount of autonomy that you're willing to give up"; "we have
to keep the AI on the leash… it's not useful to me to get a diff of 10,000
lines of code… I'm still the bottleneck"; "less Iron Man robots and more Iron
Man suits… partial autonomy products… so that the generation-verification loop
of the human is very, very fast."
The agent boundary (Karpathy, Dwarkesh 2025-10; nanochat, HN 2025-10-13):
agents handle "boilerplate… stuff that occurs very often on the Internet" but
fail on "intellectually intense… precisely arranged" code — he hand-wrote
nanochat because "the repo is too far off the data distribution."
Agentic engineering / on-rails-vs-off-roading (Karpathy, Sequoia AI Ascent
2026, 2026-04-29 — corroborated by Sequoia's channel + his bearblog summary +
multiple transcripts): "vibe coding raised the floor; agentic engineering
raises the ceiling… you are still responsible for your software just as before";
"you're either in the data distribution, on the rails of the RL circuits, and
flying, or you're off-roading in the jungle with a machete" (and why: labs
train on what's verifiable + commercially valuable); "we're not building
animals, we are summoning ghosts… jagged intelligences shaped by data and
reward functions"; "you can outsource your thinking, but you can't outsource
your understanding." Also, agent-native ergonomics: "install .md skills, not
.sh scripts."
Async vs sync (report, Claude Code team): "Fast prototyping with
auto-accept mode … autonomous loops" vs "Synchronous coding for core features …
detailed prompts with specific implementation instructions."
Slot machine (report, Data Science/ML team): "Treat it like a slot
machine. Save your state… let it run for 30 minutes, then either accept the
result or start fresh." Also: "stop Claude and ask 'why are you doing this? Try
something simpler.' The model tends toward more complex solutions by default."
One-shot first (report, RL Eng): "Try one-shot first… If it works (about
one-third of the time), you've saved significant time."
Vim mode (report): "roughly 70% of the final implementation came from
Claude's autonomous work."
Slash commands compound (report, Security Eng): "Security engineering
uses 50% of all custom slash command implementations in the entire monorepo."
Checkpoints / end-of-session / MCP-over-CLI (report): "starting from a
clean git state and committing checkpoints"; "summarize completed work sessions
and suggest improvements … to refine the CLAUDE.md documentation"; "use MCP
servers rather than the BigQuery CLI to maintain better security control."
Author ≠ approver (GitHub): "the developer who asks the agent to open a
pull request cannot be the one to approve it." MSR'26 LGTM!: 77.5% of agentic
PRs merged by the submitter; maintainers tighten the gate on code-deleting PRs.
Context is the bottleneck (Sourcegraph, "The Coding Agent Is Dead",
2026-02): "the agent… is no longer the limiting factor… how you organize your
codebase for agents… those are now the bottlenecks." DORA 2025: AI is "an
amplifier."
Three-beat SDLC (How we Claude Code video — verified against the video):
remove ambiguity ("the requirements are latent within you", Bitter Lesson) →
dense HTML artifacts over long markdown ("if the markdown files get more than
about 200 lines long, it's unlikely you're going to read it… certainly unlikely
your colleagues are" — Tar, "the unreasonable effectiveness of HTML files") →
"verify, not test" (three surfaces off one definition: human dashboard /
agent-first via Playwright MCP / headless CI — bun verify).
Concept → Implementation: the research behind the strawman
Background read for the Wed 2026-06-03 session. This is the evidence layer —
the strawman (concept-to-impl-strawman.md) is the argue-with layer, the
one-pager (concept-to-impl-onepager.md) is the skim layer. Read this if you
want the "why", or feed it to your own agent before the meeting.
Organised by theme, not by vendor — deliberately, because the biggest risk
in this space is mistaking one lab's workflow for the industry's direction. We
use Claude Code heavily and it shows up often below; but every load-bearing claim
is cross-checked against a vendor-neutral source (Thoughtworks, DORA, METR,
GitHub/Linux-Foundation standards, peer-reviewed arXiv) so the strawman rests on
the industry's direction, not one lab's house style. Where a claim is a
verbatim quote checked against the primary source it's marked [verified];
where it's a lead from a relay or a single secondary source it's marked
[lead].
0. The shape of the argument
The models crossed a line in late 2025: for a growing class of work, the
human no longer needs to be in the driving seat. [verified] (Karpathy's
own Nov→Dec 2025 inflection: ~80% hand-written → ~80% delegated.)
But the evidence is sober, not hyped. The best controlled study to date
(METR RCT) found experienced devs on familiar, complex repos went 19%
slower with AI while believing they were faster. The benefit is real but
uneven — it lands on the periphery, not the core (§2). This is the single most
important caveat for a team of senior devs on a large familiar codebase.
The discipline that replaces "vibe coding" is agentic engineering: go
faster without dropping the quality bar. The control surface is a harness
— feedforward controls (skills, specs, curated instructions) that aim the
agent, plus feedback controls (linters, type-checkers, evals, mutation tests)
that catch it before a human does (§3, Thoughtworks Vol 34).
The industry has converged — across vendors — on a concrete operating loop
inside that harness: spec → plan → tasks → implement, with a human gate at
each boundary (§4), plus verify-not-test (§7), author≠approver at
merge (§8), and context/codebase as the real bottleneck (§9). DORA 2025:
AI is an amplifier — it magnifies whatever discipline already exists.
Mellea is unusually well-positioned: we already ship the pieces (m decompose, a requirements/validation/repair core, an eval stack, an
AGENTS.md, mellea-skills-compiler). The opportunity is to adopt the
convergent loop using our own tools, not to import a vendor's workflow.
1. Autonomy as a slider (the column axis)
The cleanest framing is Karpathy's, and it predates and outlives any single
tool.
The slider [verified, Software Is Changing (Again), YC, 2025-06-17]:
"you are in charge of the autonomy slider, and depending on the complexity of
the task at hand, you can tune the amount of autonomy that you're willing to
give up."
The leash [verified, same]: "we have to keep the AI on the leash… it's not
useful to me to get a diff of 10,000 lines of code… I'm still the bottleneck."
Unconstrained, agents get "lost in the woods" [lead, relayed].
Suit, not robot [verified, same]: "less Iron Man robots and more Iron Man
suits… partial autonomy products… so that the generation-verification loop of
the human is very, very fast."
Agentic engineering [verified, Sequoia AI Ascent 2026, 2026-04-29]: "vibe
coding raised the floor; agentic engineering raises the ceiling… you are still
responsible for your software just as before." Also: "we're not building
animals, we are summoning ghosts… jagged intelligences"; "you can outsource
your thinking, but you can't outsource your understanding."
The capability boundary [verified, Sequoia 2026 + Dwarkesh 2025-10 +
nanochat]: "you're either in the data distribution, on the rails of the RL
circuits, and flying, or you're off-roading in the jungle with a machete."
Agents are on-rails for code that recurs online and that labs train against
(verifiable + commercially valuable); they go off-road on novel,
precisely-arranged code — he hand-wrote nanochat because "the repo is too far
off the data distribution."
Why this matters for us: the slider is per task-type, not per stage or per
person. The same "Implement" stage is async for a docs typo and synchronous for
the sampling loop. Our core (sampling, context management, generative-function
machinery, novel intrinsic/adapter wiring) is the off-distribution code that
stays human-led; our periphery (a backend mirroring an existing one, test
scaffolding, docs, telemetry plumbing) is where delegation pays.
2. The sober counterweight: where AI does not help (read this first)
The most industry-credible finding is also the most uncomfortable, and it is
not from a vendor.
METR RCT [verified, METR, 2025-07-10]: a randomised controlled trial — 16
experienced open-source developers, 246 real tasks on their own mature
repos. With AI tooling (Cursor Pro + Claude 3.5/3.7 Sonnet) they were 19%
slower, while they believed they were ~20% faster. Forecasts were wildly
off: the developers predicted 24% faster; ML and economics experts predicted
38–39% faster. The slowdown drivers were high repo familiarity, large/complex
codebases, and AI unreliability — the agent's suggestions cost more time to
read, verify and correct than they saved. The one developer who saw a speedup
had >50 hours of prior Cursor experience. (A Feb 2026 follow-up on late-2025
tools narrows but does not erase the gap.)
Why this is the most relevant single study for us: we are exactly the
population it measured — experienced developers on a large, familiar codebase.
The naive read ("agents make everyone faster") is the read METR falsifies. The
defensible read: agents pay off on the periphery and on unfamiliar code, and
cost time on the core you know cold — which is precisely why the slider (§1)
and task-classification (§5) matter more than raw adoption.
SPACE "Fast and Spurious" survey [verified, arXiv 2510.24265, 2025-10-28]:
415 software practitioners scored on the SPACE framework. Frequent GenAI users
report faster task completion and higher output volume, but the gains are offset
by increased code-review burden, persistent cognitive load from verifying
output, and unchanged collaboration — so the authors conclude "perceived
productivity gains may be spurious." (NSF mirror: par.nsf.gov/biblio/10677745.)
DORA 2025: "AI is an amplifier" [verified, DORA 2025, ~5,000
professionals]: 90% adoption, 80%+ believe productivity is up — yet AI
adoption shows a negative relationship with delivery stability, and ~30%
report little or no trust in AI-generated code. AI magnifies existing strengths
and weaknesses; it does not manufacture discipline.
Anthropic skill-formation RCT [verified, arXiv 2601.20245 /
anthropic.com/research/AI-assistance-coding-skills, 2026-01-29]: 52 mostly-junior
engineers built an async Python library. The AI-assisted group scored 50% vs
control 67% on a comprehension quiz (~2 letter grades lower; Cohen's d=0.738,
p=0.01) with no significant speedup. A delegation interaction pattern
predicted low scores; asking conceptual questions predicted high. The
skill-atrophy caveat that pairs with METR — speed without understanding is the
failure mode.
"Cognitive debt" [verified, Thoughtworks Tech Radar Vol 34, 2026-04-15]:
one of the radar's four themes is retaining principles while relinquishing
patterns — teams that let agents write everything lose the understanding
Karpathy warns you can't outsource. The radar's response is a deliberate
return to fundamentals: DORA metrics, pair programming, mutation testing,
clean code — the disciplines that keep a human in command of an
AI-amplified codebase.
Takeaway for the room: the goal is not "maximise autonomy". It's "move
the specific task-types where agents are reliable leftward, and defend the core
where they aren't." Everything downstream is in service of that distinction.
3. Harness engineering: the vendor-neutral umbrella
Thoughtworks' Vol 34 (2026-04-15) gives the cleanest non-vendor name for the
whole strawman: "putting coding agents on a leash" via harness engineering.
It splits the control surface into two halves — and everything in this document
slots into one of them.
Feedforward controls — aim the agent before it runs: Agent Skills,
spec-driven development, curated shared instructions. (= §4 SDD, §10 skills.)
Feedback controls — catch the agent after it runs, ideally before a human
looks: linters, type-checkers, mutation testing, custom LSPs that trigger
self-correction. (= §7 evals, our pytest tiers, ruff/mypy.)
The radar's other relevant calls (a useful neutral cross-check on our specifics):
Context engineering — ADOPT. Curating what the agent sees is now baseline
practice, not an experiment.
Curated shared instructions for software teams — ADOPT. Verbatim: "relying
on individual developers to write prompts from scratch is emerging as an
anti-pattern." The recommended fix is to anchor CLAUDE.md / AGENTS.md into
shared service templates — which is exactly our skills-as-a-product slice
(§10), independently arrived at.
Agent instruction bloat — CAUTION. Hand-written AGENTS.md often beats
LLM-generated; longer is not better. (Matches Anthropic's <200-line rule, §5,
from a different source.)
Feedback flywheel — ASSESS. Treat spec → plan → implement + continuous
harness improvement like a team retrospective: every session sharpens the
harness. (= the end-of-session habit in §5, generalised and de-Anthropic'd.)
Agent Skills — TRIAL; MCP by default; progressive context disclosure;
skills-as-executable-onboarding-docs. All consistent with our §10.
Securing permission-hungry agents — zero-trust, sandboxing, the "lethal
trifecta" (untrusted input + private data + exfiltration path). A standing
caution for any auto-merge or MCP-tool grant (§8).
The aggressive pole [verified, OpenAI Harness engineering: leveraging Codex
in an agent-first world, openai.com/index/harness-engineering/, 2026-02-11]:
OpenAI uses the same term — "harness engineering" — but pushes it toward far
more autonomy: a literal "Ralph Wiggum Loop", mechanical "golden principles", and
background tasks humans "aren't required to" review (companion pieces: Symphony
with Linear as control plane, claimed +500% landed PRs; a separate Auto-review
agent at ~99% approval). Same idea, opposite end of the slider — we take
Thoughtworks' leash reading, not OpenAI's.
Why this matters for us: "harness engineering" is the framing to open the
meeting with, because it is vendor-neutral and it makes the whole strawman one
coherent idea — we are building Mellea's harness — rather than a list of Claude
Code tricks. Our feedforward half is AGENTS.md + skills + m decompose; our
feedback half is the eval stack. Both halves already exist; the work is wiring
them into a loop.
4. The convergent operating loop: spec-driven development
SDD is the feedforward core of the harness, and it most reduces the strawman's
Anthropic-centricity because it is vendor-neutral by construction — every
major vendor has shipped a flavour of it.
The four-phase loop (names differ, structure is identical across GitHub Spec
Kit, AWS Kiro, OpenSpec, BMAD) [verified, GitHub blog 2025-09-02; arXiv
2602.00180, 2026-01-30]:
Specify → what & why (user journeys, acceptance criteria; not the tech stack)
Plan → tech stack, architecture, constraints
Tasks → small, independently testable, dependency-ordered chunks
Implement→ task by task; review focused diffs, not 1,000-line dumps
…with a human review gate at every phase boundary ("your role isn't just to
steer, it's to verify" — GitHub) and a constitution of durable project rules
(language, testing, dependency policy) that every phase must respect — usually
stored as AGENTS.md or .specify/memory/constitution.md.
Three levels of rigour [verified, Martin Fowler / Böckeler 2025-10-15; arXiv
2602.00180]:
spec-first — write a spec for this task, then code (lightweight, today).
spec-anchored — keep the spec for the life of the feature, evolve it.
spec-as-source — the spec is the maintained artifact; code is regenerated,
never hand-edited (Tessl; today only practical where generation is trusted,
e.g. OpenAPI stubs, Simulink → certified C).
Golden rule [verified, arXiv 2602.00180]: "use the minimum level of
specification rigor that removes ambiguity for your context." Spec-first for
most of our work; spec-anchored for long-lived core; spec-as-source is not for
us yet.
Tooling landscape [verified]:
GitHub Spec Kit — open-source CLI (specify), model-agnostic, works with
Claude Code / Copilot / Cursor / Gemini CLI / Codex / opencode / Qwen / Kiro
CLI and ~30 others. Slash commands: /constitution /specify /clarify /plan /tasks /analyze /implement /checklist. This is the portable reference
implementation — Microsoft, Anthropic and Google have all converged on it as
the interoperable layer.
AWS Kiro — agentic IDE; spec + plan + tasks + code in one workspace;
"hooks" run test/lint/security after every agent action. Less portable.
Tessl — spec-as-source; code marked // GENERATED FROM SPEC - DO NOT EDIT; audit trails.
Claimed payoffs [lead, vendor-reported, treat as directional not measured]:
GitHub — "roughly an order-of-magnitude fewer 'regenerate from scratch' cycles
than ad-hoc prompting"; AWS Kiro — "40-hour features in under 8 hours of human
time when authored as specs first."
Why this matters for us: SDD is the "how" the matrix is missing, and we
already ship the engine for the Tasks phase: m decompose parses a prompt into
dependency-ordered subtasks, extracts constraints, tags each "code" vs
"llm"-judge validation, and emits a runnable m.instruct() script + JSON. The
strawman's stage-2 "interview-the-author" is/specify + /clarify. Our
"design-via-draft-PR" ceremony is a heavyweight, ad-hoc version of /plan. We
can adopt the convergent loop with mostly our own parts.
5. Task classification & the async/sync split (Anthropic — as one instance)
Anthropic's published patterns are a concrete instance of the slider, useful
because they're specific and measured — but they are one team's house style, not
the standard. Cite them as illustration, not authority; the vendor-neutral
versions live in §1 (slider), §3 (harness), §4 (SDD).
Async vs sync [verified, How Anthropic teams use Claude Code report]:
"Fast prototyping with auto-accept mode… autonomous loops" for peripheral work
vs "Synchronous coding for core features… detailed prompts with specific
implementation instructions." (This is METR's periphery/core split, §2, stated
as a workflow rule.)
Vim mode "roughly 70%… Claude's autonomous work" [verified, report].
One-shot first [verified, report, RL Eng]: "Try one-shot first… If it
works (about one-third of the time), you've saved significant time."
Slot machine [verified, report, DS/ML]: "Treat it like a slot machine.
Save your state… let it run for 30 minutes, then either accept the result or
start fresh." Plus: "the model tends toward more complex solutions by default"
— stop it and ask for simpler.
Checkpoints / end-of-session / MCP-over-CLI [verified, report]: clean git
state + frequent checkpoints (= DORA "work in small batches", §11);
end-of-session "summarize work and suggest improvements to refine CLAUDE.md"
(= Thoughtworks "feedback flywheel", §3); "use MCP servers rather than the
BigQuery CLI to maintain better security control."
The three-beat SDLC [verified, How we Claude Code video, 2026-05-23] — note
this is the same Specify/Verify shape as SDD, framed for UIs:
Remove ambiguity — "the requirements are latent within you" (Bitter
Lesson); let the agent interview you. (= SDD /specify + /clarify.)
Dense artifacts over long markdown — "if the markdown files get more than
about 200 lines long, it's unlikely you're going to read it… certainly
unlikely your colleagues are"; condense into clickable HTML (Tar, "the
unreasonable effectiveness of HTML files"). (= Thoughtworks "agent instruction
bloat — CAUTION", §3, from an independent source.)
Verify, not test — one definition, three surfaces: human dashboard /
agent-first (Playwright MCP) / headless CI (bun verify). Demo: a React
to-do app emitting data-verify unit/total/done/active attributes; presenter
plants 4+3≠10 and an agent catches it via the data contract.
Translation for us: beat 3 is framed for a web app. Our "verify" surface is
evals, not a DOM — see §7.
6. Decomposition & runtime-structured repair
m decompose [verified, first-party, docs.mellea.ai] — our own
dependency-ordered decomposition CLI with per-constraint code-vs-llm-judge
validation tagging. This is the Tasks phase of SDD, already shipped.
RSTD [verified, arXiv 2605.15425, May 2026 — Asthana et al., IBM
Research, built on Mellea] — Runtime-Structured Task Decomposition: the LLM
is invoked only as narrowly-scoped judgment operators with schema-validated
outputs; on a validation failure it issues a targeted repair prompt rather
than re-running the whole task. That is exactly Mellea's
Instruct-Validate-Repair loop. Result: 73.2% retry-cost reduction vs static
decomposition, 51.7% vs monolithic, ~18% framework overhead, 100% correctness
across configs. Citation note: IBM Research, not the Mellea core team —
same-house (Mellea is IBM/generative-computing) but a different team. Cite as
"IBM Research, building on Mellea", not "our paper".
Decomposition today is a single-operator pattern [verified, our repo]:
planetf1 + AI decomposed #929 and #891 into clean phase/wave sub-issues, each
in one sitting. Strength dressed as a single point of failure — the proposal is
to generalise it into a shared skill that wraps m decompose and adds a
critique pass (Factory Droid's coordinator/critique separation is the
reference [lead]).
7. Verification = evals (our most on-brand slice, the feedback half of §3)
The industry signal is "verify, not test" — make verification native to the
artifact and runnable off one definition. In Thoughtworks' terms (§3) this is the
feedback half of the harness. For a library (not a web app) the
verification surface is our eval stack, which is the thing Mellea exists to
provide:
m eval run / TestBasedEval (LLM-as-judge) for behaviour [verified,
first-party].
BenchDrift (IBM/BenchDrift) for prompt/variation robustness [verified,
first-party reference].
the pytest tier suite (unit/integration/e2e/qualitative) headless in CI
[verified, our AGENTS.md] — plus ruff/mypy as the cheap always-on feedback
controls Thoughtworks names explicitly.
RSTD's judgment-operator + targeted-repair pattern (§6) is the measured
version of "validate then repair, don't rerun".
The point for the room: making an agent verify its own change with m eval
is dogfooding our own thesis, and it is exactly the feedback control the
vendor-neutral radar prescribes — not importing someone else's web-app harness.
This is the most one-directional finding in the whole survey — useful precisely
because it's not one vendor's opinion.
Author ≠ approver [verified, GitHub]: "the developer who asks the agent to
open a pull request cannot be the one to approve it."
Self-merge is the dominant risk [verified, MSR'26 LGTM!]: 77.5% of
agentic PRs were merged by the submitter; maintainers tighten the gate when an
agent deletes code.
Creator–verifier with fresh context [verified/lead]: Factory two-pass;
Cursor judge; GitHub self-review — the reviewer subagent sees only diff +
criteria, not the reasoning that produced the change. A ship-now instantiation
the practitioner community converged on [lead, X]: "find the riskiest line;
name the missing test."
Tiered merge authority [lead, runcycles.io 2026]: merge-to-main is
execution-equivalent, needs session-level authority, not per-call permission.
Bring as a direction, not a policy we write.
Peer-library contribution policies — what comparable projects already enforce:
vLLM AGENTS.md [verified, github.com/vllm-project/vllm/blob/main/AGENTS.md]:
the peer LLM-serving library is explicit — "Pure code-agent PRs are not allowed.
A human submitter must understand and defend the change end-to-end." Reviewers
read every changed line; one-off busywork PRs are banned; AGENTS.md is kept
<200 lines and domain guides <300. (Added in PR #36877, inspired by HuggingFace
transformers.)
Instructor CLAUDE.md [verified,
github.com/instructor-ai/instructor/blob/main/CLAUDE.md]: the peer Python LLM
library prescribes "Use stacked PRs for complex features", "Keep PRs small and
focused", a changelog-per-PR rule, and a short PR-description template
(What/Why/Changes/Testing) — peer precedent for the splitting-large-work
challenge.
Zig — the contrarian counterweight [verified, ziglang.org/code-of-conduct +
JetBrains podcast youtube.com/watch?v=iqddnwKF8HQ, 2026-05-27]: a "Strict No LLM
/ No AI Policy". Andrew Kelley calls AI contributions "invariably garbage" of
"negative value" and frames review as "contributor poker" — you bet review time
against contribution quality — driven by 200+ open PRs against limited review
capacity. Context: only 4 of 112 surveyed OSS projects ban AI outright (Zig,
NetBSD, GIMP, QEMU). The strongest contrarian read in the survey, plus a
review-economics framing.
Securing permission-hungry agents [verified, Thoughtworks Vol 34]: any
auto-merge or broad MCP grant has to respect the "lethal trifecta" — don't give
an agent untrusted input + private data + an exfiltration path at once.
It's free on a public repo: GitHub branch-protection gives
author≠approver + CI gates with no infrastructure to build. Adopt the free 90%,
cite the tier model as the expensive 10%.
9. Context/codebase is the bottleneck (not the model)
[verified, Sourcegraph "The Coding Agent Is Dead", 2026-02]: "the agent… is no
longer the limiting factor… how you organize your codebase for agents… those
are now the bottlenecks." CodeScaleBench: agents degrade past ~400K LOC;
wiring code-intelligence/MCP retrieval gave +0.26 reward, 30% cheaper, 38%
faster — "the difference… wasn't intelligence, it was efficient access to
context."
[verified, DORA 2025]: AI is "an amplifier, magnifying an organization's
existing strengths and weaknesses." A quality internal platform and
AI-accessible internal context are two of DORA's seven AI-capabilities (§11).
For us: the highest-value paid bet is code-intelligence / MCP retrieval
wired into the Investigate stage; the rest is making Mellea legible — a tight
CLAUDE.md/AGENTS.md (pointers + gotchas, <200 lines), per-package context, LSP
symbol search. This is the same "context engineering — ADOPT" the radar names.
10. Configuration & skills as a product — grounded in the open standard
This is where the "too Anthropic" risk is sharpest, and where the vendor-neutral
answer is strongest. Thoughtworks puts curated shared instructions at ADOPT
and calls hand-rolled per-developer prompting "an anti-pattern" (§3) — we reached
the same conclusion from our own drift inventory.
AGENTS.md is the emerging interoperable standard [verified, agents.md;
arXiv 2602.14690, 2,926 repos]: introduced by OpenAI Aug 2025; donated to the
Agentic AI Foundation (Linux Foundation) Dec 2025 alongside MCP and goose;
60K+ repos and 10+ native agents by Mar 2026. The empirical study found
Context Files dominate and "AGENTS.md emerging as an interoperable standard."
Skills are shallowly adopted and mostly static [verified, arXiv
2602.14690]: most repos define only one or two skills, and skills "predominantly
rely on static instructions rather than executable workflows." Vercel's eval
found repo-level AGENTS.md context outperformed tool-specific skills [lead,
Harness/Vercel]. Implication: ground the team's shared conventions in
AGENTS.md (portable across Claude, Bob, Codex, Gemini, Antigravity) first;
treat Claude-specific skills as an optimisation layer on top, not the
foundation.
The drift we found [verified, our inventory]: 17 of one machine's skills
are silent symlinks into a colleague's clone (a git pull elsewhere mutates
behaviour here); team skills vary; Bob's MCP config is empty. The fix is to
treat shared config as a versioned, evaluated product — and the portable
unit is AGENTS.md, not a vendor skill format. (Thoughtworks: anchor it in a
shared service template.)
First-party governed version [verified, github.com/generative-computing/
mellea-skills-compiler, 2026-04-23]: compiles a .md skill spec into a typed,
instrumented Mellea pipeline (mellea-skills compile / /mellea-fy), then
mellea-skills certify runs Granite Guardian + NIST AI RMF checks and emits a
PolicyManifest + JSONL audit trail. This is "skills as a governed product"
already realised in our ecosystem.
Karpathy, same signal [verified, Sequoia 2026]: "install .md skills, not
.sh scripts" — the skill is the interface now.
11. Measurement (vendor-neutral, and we have baselines)
DORA is the neutral frame, and DORA 2025 gives an actionable model rather
than just metrics. The AI Capabilities Model names seven organisational
capabilities that amplify AI's benefit [verified, DORA 2025]:
a clear AI policy/stance, 2. a healthy data ecosystem, 3. AI-accessible
internal data, 4. a quality internal platform, 5. strong version control with
easy rollback, 6. working in small batches, 7. a user-centric focus.
Capabilities 5 and 6 map directly onto cheap habits we can adopt this iteration
(frequent checkpoints, small PRs, easy revert) — and they're the same habits
Anthropic's "clean git state + checkpoints" describes (§5), from a neutral
source.
The amplifier warning [verified, DORA 2025]: AI adoption had a negative
relationship with delivery stability and ~30% of practitioners distrust
AI-generated code — so the metrics will magnify whatever discipline (or lack of
it) we already have. Measure before scaling autonomy.
Our baselines [verified, our repo, last 60 days 2026-04-01→06-01]: 207
issues opened / 230 closed (net +94); 161 PRs merged (Apr 97, May 64);
time-to-merge median 1.3d, mean 3.5d, p90 8.4d, max 34.3d; four people merged
67% of PRs; ~37% of commits carry an AI trailer (a floor — not everyone marks).
Candidate process metrics: TTM, AI-plan revision count, fraction of issues
triaged without a human read, external-triage latency.
12. What being 6 people on an OSS LLM library rules out
No parallel agent fleets. Cursor runs "hundreds"; Factory warns they
conflict and runs serially. At 6 on a shared main, serial-with-good-worktrees
is the default; fleet orchestration is a non-goal.
No persistent watcher / runtime-authority agents. No prod telemetry to
watch (we're a library); Sensing is a periodic digest, not a daemon. The
Devin Auto-Triage "watch the error stream" shape doesn't map.
No bespoke governance engine. GitHub branch-protection gives
author≠approver + CI gates free; cite runcycles as direction, don't build it.
Verification is evals, not a web-app harness — §7.
Bias everywhere: adopt the free 90%, cite the expensive 10% as direction.
13. Open tensions (bring as questions, not answers)
Given METR (§2), where exactly is our periphery/core line — what task-types do
we actually trust to an agent on this codebase?
Parallel agent fleets vs serial (Cursor "hundreds" vs Factory "agents
conflict") — at 6 people, probably serial.
Editor vs cloud vs terminal as the locus of work.
How autonomous at merge — is the smallest auto-mergeable unit nothing?
(Poles: vLLM — "a human must defend the change end-to-end"; Zig — ban AI
contributions outright.)
Spec rigour — spec-first everywhere, or spec-anchored for core?
Skills canonical vs personal; AGENTS.md-first vs vendor-skill-first.
Do we publish how a 6-person generative-computing team does this?
Sources (with verification status)
Verified against primary source:
Karpathy — Software Is Changing (Again) (YC, 2025-06-17); From Vibe Coding
to Agentic Engineering (Sequoia AI Ascent 2026, 2026-04-29); Dwarkesh
interview (2025-10); nanochat (HN, 2025-10-13).
METR — Measuring the Impact of Early-2025 AI on Experienced OSS Developer
Productivity (RCT, 2025-07-10; 16 devs, 246 tasks, −19%).
SPACE — Fast and Spurious (arXiv 2510.24265, 2025-10-28; 415 practitioners;
NSF mirror par.nsf.gov/biblio/10677745).
Anthropic — AI assistance and coding skill formation RCT (arXiv 2601.20245 /
anthropic.com/research/AI-assistance-coding-skills, 2026-01-29; 52 engineers;
50% vs 67%).
OpenAI — Harness engineering: leveraging Codex in an agent-first world
(openai.com/index/harness-engineering/, 2026-02-11).
vLLM — AGENTS.md (github.com/vllm-project/vllm; PR #36877; pure code-agent PRs
banned).