Skip to content

Instantly share code, notes, and snippets.

@ahoward
Last active April 16, 2026 05:18
Show Gist options
  • Select an option

  • Save ahoward/4d78ec053c4b61e2a8656993e7ab5369 to your computer and use it in GitHub Desktop.

Select an option

Save ahoward/4d78ec053c4b61e2a8656993e7ab5369 to your computer and use it in GitHub Desktop.
bunny2 design doc — a dark factory that never forgets

bunny2 — design doc

a dark factory that never forgets


problem statement

bunny1 works. it can spec, challenge, plan, test, and build. but it forgets everything between hops. every session starts cold. the LLM reconstructs context from file scraps and sounds confident while confabulating. the brane exists but nothing forces the loop to read it before starting or write to it after finishing. memory is opt-in. forgetting is the default.

the result: hop 5 makes the same mistake hop 2 made. the factory doesn't learn. it rebuilds from scratch every time. this is the single biggest barrier to dark factory utility.

goals

  1. never forget — every decision, defect, surprise, and architectural choice persists and is recalled
  2. bomber reliability — russian train. runs unattended. crashes are detected and recovered. no silent failures
  3. gets better — each hop feeds the next. test strategies that work get reinforced. patterns that fail get avoided
  4. accepts steering — human can redirect mid-run without restarting
  5. greenfield and brownfield — build from zero or evolve existing code. same loop

non-goals

  • not a distributed system. one machine, one repo, one factory
  • not a framework. not extensible by plugins. KISS
  • not real-time. phase-boundary granularity is fine
  • not multi-repo. one repo per bunny instance

composable skill pipeline

bunny1 hardcodes the phase sequence in hop.ts. bunny2 makes it data.

each skill is a function with a standard signature:

type Skill = (ctx: HopContext) => SkillResult

the pipeline is a list of skill names:

// hop pipeline
const hop_pipeline = [
  "brainstorm",
  "specify",
  "challenge",
  "review:spec-compliance",
  "plan",
  "tasks",
  "review:spec-compliance",
  "test-gen",        // 3×3 narrowing (compound skill)
  "implement",
  "verify",
  "memorize",        // mandatory gate — cannot skip
]

// spike pipeline — same skills, different selection
const spike_pipeline = [
  "brainstorm",
  "specify",
  "plan",
  "tasks",
  "test-gen",
  "implement",
  "memorize",
]

skills live in bny/skills/, one file each:

bny/skills/
├── brainstorm.ts       # socratic pre-spec exploration
├── specify.ts          # write behavioral spec
├── challenge.ts        # adversarial spec review (gemini)
├── plan.ts             # implementation plan
├── tasks.ts            # task breakdown
├── test-gen.ts         # 3×3 narrowing (compound: 3 rounds)
├── implement.ts        # code with ralph retries
├── verify.ts           # adversarial + behavioral review (gemini)
├── review.ts           # mid-pipeline spec-compliance check
├── memorize.ts         # mandatory memory write (fresh agent)
└── brainstorm.ts       # socratic pre-spec

the pipeline is an array you can read, diff, and rearrange. want to add double-challenge? insert another "challenge" entry. want to skip verify on a spike? omit it from the list. the orchestrator iterates the array — it doesn't know what the skills do.

compound skills: test-gen is inherently stateful (round 1 → 2 → 3). it's still one skill entry in the pipeline, but internally it manages its own sub-loop. the pipeline doesn't need to know about rounds.

steering checks happen between every skill. the orchestrator reads bny/steering.md between each skill invocation. no special checkpoint logic needed — it's just part of the iteration.

gates: some skills are mandatory. memorize cannot be removed from any pipeline. the orchestrator enforces this — if the pipeline definition doesn't end with memorize, it appends it.


worktree isolation

this is a bug fix from bunny1, not a feature.

bunny1 shares the working directory between the hop and everything else. this causes:

  1. no parallel hops — can't work on two features simultaneously
  2. dirty tree corruption — failed implement retries leave partial changes. the next retry sees corrupted state
  3. main is polluted — half-built features in the working tree block other work
  4. no clean rollback — can't discard a failed hop without manual cleanup

bunny2 creates a git worktree for every hop:

# hop start
git worktree add .worktrees/003-auth-flow -b 003-auth-flow

# all work happens in the worktree
cd .worktrees/003-auth-flow
# spec, plan, test, build — fully isolated

# ralph retries: clean slate per attempt
git checkout -- .

# on success: merge back to main
git checkout main
git merge 003-auth-flow
git worktree remove .worktrees/003-auth-flow

# on failure: worktree can be inspected or discarded
git worktree remove --force .worktrees/003-auth-flow

rules:

  • every bny hop creates a worktree. no exceptions
  • main is sacred. never modified directly by the factory
  • each worktree is disposable. failed hops leave no trace on main
  • ralph resets the worktree between retries (git checkout -- .)
  • parallel hops are possible (different worktrees, different branches)
  • the worktree path is stored in bny/state.json for the tick to find

brainstorm phase

bunny1 jumps straight to spec. bunny2 starts with a socratic brainstorm.

why: specs written without exploration drift. the LLM latches onto the first interpretation of the description and builds from there. a 2-minute brainstorm surfaces alternatives, clarifies intent, and reduces spec churn.

how it works:

  1. claude receives the feature description + routed memory
  2. prompt: "before writing a spec, explore this. what are we actually building? what are 3 different approaches? what are the tradeoffs? what questions should we ask?"
  3. output: a short design exploration document saved to specs/{feature}/brainstorm.md
  4. this feeds into specify as additional context

for unattended hops: the brainstorm runs without human input. it's not interactive — it's the LLM arguing with itself about what to build before committing to a direction.

for attended hops (--interactive): the brainstorm is presented to the human for refinement before proceeding. steering can redirect here.


mid-pipeline spec-compliance review

bunny1 only verifies at the end (phase 4). by then, drift from the spec has compounded through plan → tasks → tests → implementation. catching drift early is cheaper.

where it runs: after plan (did the plan drift from spec?) and after test-gen (do the tests actually cover the spec?).

how it works:

  1. gemini receives: spec.md + challenge.md + the artifact being reviewed (plan.md or test files)
  2. prompt: "does this artifact faithfully implement the spec? identify any drift, missing requirements, or scope creep. cite specific spec sections."
  3. output: pass/fail with findings
  4. on fail: the finding is injected as steering into the next skill. the pipeline doesn't stop — it self-corrects

this is NOT the adversarial verify. verify (phase 4) looks for bugs, security issues, and implementation flaws. spec-compliance review looks for drift — "the spec says X but the plan says Y."


fresh-agent memorize

bunny1's ruminate runs in the same context as the build. it's biased by the implementation struggle — over-indexes on what was hard for claude, under-indexes on what matters for the project.

bunny2's memorize is a fresh agent that sees only artifacts:

what the memorize agent sees:

  • specs/{feature}/spec.md — what we intended
  • specs/{feature}/challenge.md — what gemini warned about
  • specs/{feature}/verify.md — what gemini found post-build
  • git diff main...HEAD — what actually changed
  • test results summary — what passed/failed
  • current bny/memory/* — what we already know

what it does NOT see:

  • implementation prompts
  • ralph retry logs
  • intermediate failures
  • the "conversation" of the build

why isolation matters: same principle as the adversarial design. the memorize agent is a historian, not a participant. it judges the outcome, not the process. this produces cleaner, less biased memory entries.

implementation: spawn a fresh claude subprocess with only the artifact files. no session continuity from the build phase. outputs JSON operations that a deterministic function applies to bny/memory/.


mandatory TDD gate

bunny1's test-first design is structural (gemini writes tests, claude implements) but there's no guard against claude sneaking implementation changes into earlier phases.

bunny2 enforces: no src/ changes until the test-gen skill runs.

implementation:

  • the orchestrator tracks which files each skill is allowed to modify
  • specify, challenge, plan, tasks: may only write to specs/
  • test-gen: may write to tests/
  • implement: may write to src/ and tests/
  • memorize: may write to bny/memory/
  • after each skill, git diff --name-only is checked against the allowed paths
  • violations are logged as warnings (not hard failures — the LLM might legitimately need to create a config file)

this is a tripwire, not a jail. but it catches the common failure mode: claude "helpfully" writing implementation code during the planning phase.


architecture: local-first, github as projection

gemini's review exposed a critical flaw in the original design: making github the primary state store creates a fragile distributed system. the corrected architecture:

┌─────────────────────────────────────────────────────┐
│                LOCAL (source of truth)                │
│                                                       │
│  bny/memory/      = persistent memory (versioned)     │
│  bny/state.json   = pipeline cursor                   │
│  bny/steering.md  = queued human direction             │
│  specs/           = feature artifacts                  │
│  src/ + tests/    = code                               │
│                                                       │
├─────────────────────────────────────────────────────┤
│                GITHUB (projection + communication)    │
│                                                       │
│  Issues    = intent (what to build, why)               │
│  PRs       = attempts (branch, phases, diff)           │
│  Comments  = narrative + steering input                │
│  Labels    = state projection for humans               │
│                                                       │
├─────────────────────────────────────────────────────┤
│                SUPERVISOR (heartbeat)                  │
│                                                       │
│  bny tick  = stateless health check                    │
│  runs via  = cron, systemd, or manual loop             │
│  reads     = local state + gh (for steering only)      │
│  decides   = continue, retry, escalate, idle           │
│                                                       │
└─────────────────────────────────────────────────────┘

why local-first?

gemini nailed this: if you unplug ethernet, the factory should still build. only pushing the PR requires network. memory, state, steering — all local files, versioned with the code. when you branch, memory branches. when you revert, memory reverts. no split-brain.

github is a projection: PRs show humans what's happening, comments let humans steer, issues track intent. but the factory doesn't need github to think or remember.

memory lives in the repo

bny/memory/
├── architecture.md      # how the system works
├── patterns.md          # what works (test strategies, code patterns)
├── anti-patterns.md     # what doesn't work (failed approaches)
├── decisions.md         # why things are the way they are (ADRs)
├── defects.md           # bug catalog with root causes
├── vocabulary.md        # domain terms
└── index.md             # table of contents (auto-generated)

why not wiki? gemini was right — wiki is a separate git repo. memory diverges from code. merge conflicts in a shadow repo nobody checks out. memory MUST be versioned with the code it describes. bny/memory/ is committed, branched, merged, and diffed alongside src/.

why not brane worldview? brane's worldview/ directory is essentially this, but nothing enforces reading or writing to it. bunny2 makes the same directory structure mandatory with gates.

state is a local json file

{
  "feature": "003-auth-flow",
  "pipeline": "hop",
  "phase": "test",
  "phase_status": "running",
  "narrow_round": 2,
  "pr_num": 27,
  "issue_num": 15,
  "started_at": "2026-04-05T10:00:00Z",
  "updated_at": "2026-04-05T10:15:00Z",
  "pid": 12345,
  "steering_consumed_at": "2026-04-05T10:12:00Z"
}

github labels mirror this for humans. but the factory reads bny/state.json, not github.


the supervisor: bny tick

a stateless script that checks health and takes action. how you schedule it is your business.

# option 1: cron (survives reboot)
* * * * * /path/to/bny tick >> /tmp/bny-tick.log 2>&1

# option 2: systemd (restarts on crash)
# option 3: manual loop
while true; do bny tick; sleep 30; done

each tick:

1. read bny/state.json
2. is a hop subprocess alive? (kill -0 $pid)
   - alive + running → check for steering, relay it, exit
   - dead + no done status → CRASHED. retry or escalate
   - done:0 → SUCCESS. post-hop memory write if not done. exit
   - done:N → FAILED. escalate. exit
3. no active hop?
   - check bny/state.json for queued work
   - or check github issues for `bunny` + `ready` (if online)
   - start new hop if work found
4. exit

why not a daemon? gemini suggested a supervisor loop, and that's fine as one scheduling option. the tick itself is stateless either way — it reads local state, acts, exits. no long-running process to leak memory or zombie.

pidfile guard: yes, stale pids are a real problem. the tick checks kill -0 $pid and also checks /proc/$pid/cmdline on linux to verify it's actually a bunny process, not a recycled PID. if stale, clean up and continue.


the memory loop

every hop reads memory before starting and writes memory after finishing. not optional. enforced by gates.

pre-hop: load (mandatory)

before phase 1, the factory MUST:

  1. classify the task — what area of the codebase does this touch? (quick LLM call or keyword match)
  2. load relevant memory — not everything. routed by area:
    • bny/memory/architecture.md — always (it's the map)
    • bny/memory/defects.md — grep for entries matching the area
    • bny/memory/patterns.md — grep for entries matching the area
    • bny/memory/anti-patterns.md — grep for entries matching the area
    • bny/memory/decisions.md — grep for entries matching the area
  3. load recent hop context — last 2-3 hop summaries from git log or PR comments (if online)
  4. load steeringbny/steering.md if non-empty

this is injected as ## what you already know in the spec prompt.

why routing instead of dumping? gemini flagged "lost in the middle" syndrome. shoving 50k tokens of memory into every prompt degrades attention. route by area — a database feature reads database defects, not UI patterns. keeps context tight.

token budget: hard cap of ~8k tokens for memory injection. if routed content exceeds this, truncate by recency (newest entries first).

post-hop: persist (mandatory gate)

after phase 4, before the PR can be marked done, the factory MUST:

  1. update defects — new defects from verify get added to bny/memory/defects.md
  2. update patterns — if a test strategy caught a real bug, record it
  3. update anti-patterns — if an approach failed, record why
  4. update architecture — if module boundaries changed
  5. update decisions — any significant "why" choices
  6. commit memory changes — in the same branch, pushed with the PR

the PR cannot be marked done until memory is committed. this is the gate.

who writes memory? claude, via a structured prompt at end of build phase. replaces the optional ruminate step with a mandatory memorize step that outputs structured updates to each memory file.

format degradation: gemini flagged that LLMs corrupt markdown structure over time. mitigation: each memory file has a strict format. the memorize step outputs JSON operations ({ file, action: "append"|"update", entry: {...} }). a deterministic function applies them. the LLM never directly edits the markdown.

mid-hop: accumulate

during the hop, observations accumulate in PR comments (if online) and local files:

  • challenge findingsspecs/{feature}/challenge.md (already exists)
  • test failures → logged by ralph, summarized at phase end
  • verify findingsspecs/{feature}/verify.md
  • ralph retries → iteration count per round recorded in state

these feed the post-hop memorize step.


steering

local steering (always works)

bny steer "skip the cache optimization, focus on error handling"

writes to bny/steering.md. the tick or the running hop checks this file at phase boundaries. consumed entries are removed.

remote steering (requires network)

human comments on the active PR:

@bunny focus on error handling, skip the caching optimization

the tick polls the active PR for new comments (once per tick, not continuously). new @bunny comments are:

  1. appended to bny/steering.md
  2. marked with a reaction (👀) so they're not re-processed
  3. consumed at the next phase boundary

steering granularity: phase boundaries only. you cannot interrupt a running LLM call. if a human says @bunny stop, the current phase finishes, then the factory reads the stop directive and halts. this is honest about what's possible.

parsing: simple prefix match. @bunny or /bunny at start of comment. free-form text after. no command syntax — the LLM interprets the intent. keeps it simple.


defect memory

defects are the highest-value memories. structured format in bny/memory/defects.md:

# Defect Catalog

## D-001: off-by-one in pagination
- **area:** data-access
- **found-by:** verify:adversarial, PR #23
- **root-cause:** boundary condition on last page
- **caught-by:** boundary test (round 3)
- **pattern:** always test N, N-1, N+1 for count/index operations
- **status:** fixed

## D-002: race condition in cache invalidation
- **area:** caching
- **found-by:** verify:adversarial, PR #25
- **root-cause:** missing lock on shared map
- **caught-by:** property test (round 2)
- **pattern:** any shared mutable state needs concurrency test
- **status:** fixed

how it compounds:

  • pre-hop: test-gen reads defects for the relevant area → writes targeted regression tests
  • round 3 (boundaries): specifically targets patterns from past defects
  • verify: checks if known defect patterns recur in new code
  • the factory literally learns from its mistakes

the hop loop (revised)

issue (intent) or `bny hop "description"`
  ↓
create worktree + branch
  ↓
classify task area
  ↓
pre-hop: load routed memory + recent hops + steering
  ↓
create draft PR (attempt)
  ↓
brainstorm (claude) — explore alternatives, clarify intent
  ↓ steering check
specify (claude) + challenge (gemini)
  ↓ steering check
review:spec-compliance (gemini) — did challenge drift from spec?
  ↓ steering check
plan (claude) + tasks (claude)
  ↓ steering check
review:spec-compliance (gemini) — did plan drift from spec?
  ↓ steering check
test-gen 3×3 narrowing (gemini gen, claude impl)
  ↓ steering check [TDD gate: src/ changes allowed from here]
implement (claude + ralph)
  ↓ steering check
verify (gemini — adversarial + behavioral)
  ↓
memorize (fresh agent — mandatory gate)
  ↓
commit memory + merge worktree + close PR + close issue

the pipeline is data, not code. the above is the default hop pipeline. spike drops challenge, review, and verify. custom pipelines can rearrange skills freely. memorize is always appended.

worktree lifecycle:

  • created at hop start, on a feature branch
  • ralph resets to clean state between retries (git checkout -- .)
  • on success: merged to main, worktree removed
  • on failure: worktree preserved for inspection, main untouched

failure at any phase:

  1. write partial learnings to memory (what worked before the failure)
  2. post error to PR comment (if online)
  3. set failed state in bny/state.json + PR labels
  4. leave issue open for next attempt
  5. worktree preserved — human can inspect, the next hop starts with memory of what went wrong

what stays from bunny1

  • adversarial multi-LLM — claude builds, gemini breaks
  • 3×3 narrowing — progressive test hardening
  • ralph retry loops — exponential backoff
  • assassin process management — child cleanup on signals
  • guardrails — constraint injection
  • labels as state machine — for human visibility on PRs
  • sanitization — strip paths/secrets before github
  • POD, snake_case, guard-early — coding conventions

what changes from bunny1

bunny1 bunny2 why
hardcoded phase sequence in hop.ts composable skill pipeline (array of functions) rearrange without rewiring
shared working directory worktree per hop (main is sacred) isolation, parallel hops, clean rollback
no pre-spec exploration brainstorm skill before specify reduce spec drift, surface alternatives
verify only at end spec-compliance review mid-pipeline catch drift early, self-correct
no file-path enforcement TDD gate: no src/ changes until test-gen prevent premature implementation
memory opt-in memory mandatory (gates) never forget
no pre-hop context routed memory load start warm
full memory dump area-routed, budget-capped avoid "lost in the middle"
ruminate (in build context) memorize (fresh agent, artifacts only) unbiased memory, historian not participant
LLM edits markdown directly JSON ops → deterministic apply prevent format rot
github as state source local-first, github as projection works offline, no split-brain
wiki for memory bny/memory/ in-repo versioned with code
dev/bg daemon stateless tick (any scheduler) no long-running process
no steering local + PR comment steering mid-run direction
defects not tracked structured defect catalog learn from mistakes
roadmap.md only issues as backlog (optional) shared, linked
brane = separate system memory = part of the loop unified

what gets dropped

  • wiki as state — memory is in-repo, not a separate git repo
  • github as primary state — local-first. github is projection
  • dev/bg as daemon — replaced by stateless tick
  • ruminate as optional — replaced by mandatory memorize gate (fresh agent)
  • brane as separate subsystem — memory IS the brane, built into the loop
  • shared working directory — replaced by worktree per hop

open questions

  1. memory routing accuracy — keyword grep vs. LLM classification for area matching? grep is faster and dumber. LLM is slower and smarter. start with grep, upgrade if needed
  2. memory file growth — what happens when defects.md has 500 entries? pagination? archival? prune old fixed defects?
  3. multi-hop features — a feature spanning 3 hops creates 3 PRs. how do we link the narrative? issue as anchor (multiple PRs reference one issue)?
  4. brownfield bootstrap — for existing projects, bny digest . to seed memory? or manual curation?
  5. token budget tuning — 8k cap is a guess. need to measure actual useful context size
  6. memorize prompt engineering — the quality of memory depends entirely on the memorize prompt. this is the hardest prompt to get right

success criteria

  1. run 10 hops sequentially. hop 10 should reference learnings from hop 1
  2. kill the process mid-hop. tick should detect and resume within 2 minutes
  3. drop steering via bny steer. next phase should reflect it
  4. introduce a known defect pattern. factory should catch it based on defect catalog
  5. unplug ethernet. factory should still complete the current hop (minus PR updates)
  6. git log bny/memory/ should show memory growing and evolving across hops
  7. run two hops in parallel on different features. worktrees isolate them. main stays clean
  8. a failed hop leaves main untouched. worktree can be inspected. next hop starts clean
  9. ralph retries start from clean worktree state, not corrupted partial changes
  10. spec-compliance review catches plan drift before test-gen runs
  11. brainstorm.md shows alternatives explored before spec was written
  12. rearrange the skill pipeline array. factory executes the new order without code changes

gemini review findings (incorporated above)

for transparency, these are the key criticisms from gemini's adversarial review that shaped this converged design:

  1. "github as primary state is a distributed system in disguise" — accepted. flipped to local-first
  2. "wiki is a separate git repo, creates split-brain" — accepted. moved memory in-repo
  3. "cron pidfile guards are fragile" — accepted. added /proc/$pid/cmdline verification
  4. "LLMs will corrupt markdown structure" — accepted. added JSON ops → deterministic apply
  5. "memory dump causes lost-in-the-middle" — accepted. added area routing + token budget
  6. "factory should work offline" — accepted. github is projection, not dependency
  7. "steering only works at phase boundaries" — accepted. documented honestly
  8. "use a supervisor loop, not cron" — partially accepted. tick is scheduler-agnostic

rejected:

  • "local steering only" — remote steering via PR comments is too valuable for unattended runs. worth the polling cost
  • "labels are insane for state" — labels are fine for projection. they're not the source of truth, just the human-readable mirror

this is the converged design. ready for implementation planning when you are.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment