a dark factory that never forgets
bunny1 works. it can spec, challenge, plan, test, and build. but it forgets everything between hops. every session starts cold. the LLM reconstructs context from file scraps and sounds confident while confabulating. the brane exists but nothing forces the loop to read it before starting or write to it after finishing. memory is opt-in. forgetting is the default.
the result: hop 5 makes the same mistake hop 2 made. the factory doesn't learn. it rebuilds from scratch every time. this is the single biggest barrier to dark factory utility.
- never forget — every decision, defect, surprise, and architectural choice persists and is recalled
- bomber reliability — russian train. runs unattended. crashes are detected and recovered. no silent failures
- gets better — each hop feeds the next. test strategies that work get reinforced. patterns that fail get avoided
- accepts steering — human can redirect mid-run without restarting
- greenfield and brownfield — build from zero or evolve existing code. same loop
- not a distributed system. one machine, one repo, one factory
- not a framework. not extensible by plugins. KISS
- not real-time. phase-boundary granularity is fine
- not multi-repo. one repo per bunny instance
bunny1 hardcodes the phase sequence in hop.ts. bunny2 makes it data.
each skill is a function with a standard signature:
type Skill = (ctx: HopContext) => SkillResultthe pipeline is a list of skill names:
// hop pipeline
const hop_pipeline = [
"brainstorm",
"specify",
"challenge",
"review:spec-compliance",
"plan",
"tasks",
"review:spec-compliance",
"test-gen", // 3×3 narrowing (compound skill)
"implement",
"verify",
"memorize", // mandatory gate — cannot skip
]
// spike pipeline — same skills, different selection
const spike_pipeline = [
"brainstorm",
"specify",
"plan",
"tasks",
"test-gen",
"implement",
"memorize",
]skills live in bny/skills/, one file each:
bny/skills/
├── brainstorm.ts # socratic pre-spec exploration
├── specify.ts # write behavioral spec
├── challenge.ts # adversarial spec review (gemini)
├── plan.ts # implementation plan
├── tasks.ts # task breakdown
├── test-gen.ts # 3×3 narrowing (compound: 3 rounds)
├── implement.ts # code with ralph retries
├── verify.ts # adversarial + behavioral review (gemini)
├── review.ts # mid-pipeline spec-compliance check
├── memorize.ts # mandatory memory write (fresh agent)
└── brainstorm.ts # socratic pre-spec
the pipeline is an array you can read, diff, and rearrange. want to add double-challenge? insert another "challenge" entry. want to skip verify on a spike? omit it from the list. the orchestrator iterates the array — it doesn't know what the skills do.
compound skills: test-gen is inherently stateful (round 1 → 2 → 3). it's still one skill entry in the pipeline, but internally it manages its own sub-loop. the pipeline doesn't need to know about rounds.
steering checks happen between every skill. the orchestrator reads bny/steering.md between each skill invocation. no special checkpoint logic needed — it's just part of the iteration.
gates: some skills are mandatory. memorize cannot be removed from any pipeline. the orchestrator enforces this — if the pipeline definition doesn't end with memorize, it appends it.
this is a bug fix from bunny1, not a feature.
bunny1 shares the working directory between the hop and everything else. this causes:
- no parallel hops — can't work on two features simultaneously
- dirty tree corruption — failed implement retries leave partial changes. the next retry sees corrupted state
- main is polluted — half-built features in the working tree block other work
- no clean rollback — can't discard a failed hop without manual cleanup
bunny2 creates a git worktree for every hop:
# hop start
git worktree add .worktrees/003-auth-flow -b 003-auth-flow
# all work happens in the worktree
cd .worktrees/003-auth-flow
# spec, plan, test, build — fully isolated
# ralph retries: clean slate per attempt
git checkout -- .
# on success: merge back to main
git checkout main
git merge 003-auth-flow
git worktree remove .worktrees/003-auth-flow
# on failure: worktree can be inspected or discarded
git worktree remove --force .worktrees/003-auth-flowrules:
- every
bny hopcreates a worktree. no exceptions - main is sacred. never modified directly by the factory
- each worktree is disposable. failed hops leave no trace on main
- ralph resets the worktree between retries (
git checkout -- .) - parallel hops are possible (different worktrees, different branches)
- the worktree path is stored in
bny/state.jsonfor the tick to find
bunny1 jumps straight to spec. bunny2 starts with a socratic brainstorm.
why: specs written without exploration drift. the LLM latches onto the first interpretation of the description and builds from there. a 2-minute brainstorm surfaces alternatives, clarifies intent, and reduces spec churn.
how it works:
- claude receives the feature description + routed memory
- prompt: "before writing a spec, explore this. what are we actually building? what are 3 different approaches? what are the tradeoffs? what questions should we ask?"
- output: a short design exploration document saved to
specs/{feature}/brainstorm.md - this feeds into specify as additional context
for unattended hops: the brainstorm runs without human input. it's not interactive — it's the LLM arguing with itself about what to build before committing to a direction.
for attended hops (--interactive): the brainstorm is presented to the human for refinement before proceeding. steering can redirect here.
bunny1 only verifies at the end (phase 4). by then, drift from the spec has compounded through plan → tasks → tests → implementation. catching drift early is cheaper.
where it runs: after plan (did the plan drift from spec?) and after test-gen (do the tests actually cover the spec?).
how it works:
- gemini receives: spec.md + challenge.md + the artifact being reviewed (plan.md or test files)
- prompt: "does this artifact faithfully implement the spec? identify any drift, missing requirements, or scope creep. cite specific spec sections."
- output: pass/fail with findings
- on fail: the finding is injected as steering into the next skill. the pipeline doesn't stop — it self-corrects
this is NOT the adversarial verify. verify (phase 4) looks for bugs, security issues, and implementation flaws. spec-compliance review looks for drift — "the spec says X but the plan says Y."
bunny1's ruminate runs in the same context as the build. it's biased by the implementation struggle — over-indexes on what was hard for claude, under-indexes on what matters for the project.
bunny2's memorize is a fresh agent that sees only artifacts:
what the memorize agent sees:
specs/{feature}/spec.md— what we intendedspecs/{feature}/challenge.md— what gemini warned aboutspecs/{feature}/verify.md— what gemini found post-buildgit diff main...HEAD— what actually changed- test results summary — what passed/failed
- current
bny/memory/*— what we already know
what it does NOT see:
- implementation prompts
- ralph retry logs
- intermediate failures
- the "conversation" of the build
why isolation matters: same principle as the adversarial design. the memorize agent is a historian, not a participant. it judges the outcome, not the process. this produces cleaner, less biased memory entries.
implementation: spawn a fresh claude subprocess with only the artifact files. no session continuity from the build phase. outputs JSON operations that a deterministic function applies to bny/memory/.
bunny1's test-first design is structural (gemini writes tests, claude implements) but there's no guard against claude sneaking implementation changes into earlier phases.
bunny2 enforces: no src/ changes until the test-gen skill runs.
implementation:
- the orchestrator tracks which files each skill is allowed to modify
- specify, challenge, plan, tasks: may only write to
specs/ - test-gen: may write to
tests/ - implement: may write to
src/andtests/ - memorize: may write to
bny/memory/ - after each skill,
git diff --name-onlyis checked against the allowed paths - violations are logged as warnings (not hard failures — the LLM might legitimately need to create a config file)
this is a tripwire, not a jail. but it catches the common failure mode: claude "helpfully" writing implementation code during the planning phase.
gemini's review exposed a critical flaw in the original design: making github the primary state store creates a fragile distributed system. the corrected architecture:
┌─────────────────────────────────────────────────────┐
│ LOCAL (source of truth) │
│ │
│ bny/memory/ = persistent memory (versioned) │
│ bny/state.json = pipeline cursor │
│ bny/steering.md = queued human direction │
│ specs/ = feature artifacts │
│ src/ + tests/ = code │
│ │
├─────────────────────────────────────────────────────┤
│ GITHUB (projection + communication) │
│ │
│ Issues = intent (what to build, why) │
│ PRs = attempts (branch, phases, diff) │
│ Comments = narrative + steering input │
│ Labels = state projection for humans │
│ │
├─────────────────────────────────────────────────────┤
│ SUPERVISOR (heartbeat) │
│ │
│ bny tick = stateless health check │
│ runs via = cron, systemd, or manual loop │
│ reads = local state + gh (for steering only) │
│ decides = continue, retry, escalate, idle │
│ │
└─────────────────────────────────────────────────────┘
gemini nailed this: if you unplug ethernet, the factory should still build. only pushing the PR requires network. memory, state, steering — all local files, versioned with the code. when you branch, memory branches. when you revert, memory reverts. no split-brain.
github is a projection: PRs show humans what's happening, comments let humans steer, issues track intent. but the factory doesn't need github to think or remember.
bny/memory/
├── architecture.md # how the system works
├── patterns.md # what works (test strategies, code patterns)
├── anti-patterns.md # what doesn't work (failed approaches)
├── decisions.md # why things are the way they are (ADRs)
├── defects.md # bug catalog with root causes
├── vocabulary.md # domain terms
└── index.md # table of contents (auto-generated)
why not wiki? gemini was right — wiki is a separate git repo. memory diverges from code. merge conflicts in a shadow repo nobody checks out. memory MUST be versioned with the code it describes. bny/memory/ is committed, branched, merged, and diffed alongside src/.
why not brane worldview? brane's worldview/ directory is essentially this, but nothing enforces reading or writing to it. bunny2 makes the same directory structure mandatory with gates.
{
"feature": "003-auth-flow",
"pipeline": "hop",
"phase": "test",
"phase_status": "running",
"narrow_round": 2,
"pr_num": 27,
"issue_num": 15,
"started_at": "2026-04-05T10:00:00Z",
"updated_at": "2026-04-05T10:15:00Z",
"pid": 12345,
"steering_consumed_at": "2026-04-05T10:12:00Z"
}github labels mirror this for humans. but the factory reads bny/state.json, not github.
a stateless script that checks health and takes action. how you schedule it is your business.
# option 1: cron (survives reboot)
* * * * * /path/to/bny tick >> /tmp/bny-tick.log 2>&1
# option 2: systemd (restarts on crash)
# option 3: manual loop
while true; do bny tick; sleep 30; doneeach tick:
1. read bny/state.json
2. is a hop subprocess alive? (kill -0 $pid)
- alive + running → check for steering, relay it, exit
- dead + no done status → CRASHED. retry or escalate
- done:0 → SUCCESS. post-hop memory write if not done. exit
- done:N → FAILED. escalate. exit
3. no active hop?
- check bny/state.json for queued work
- or check github issues for `bunny` + `ready` (if online)
- start new hop if work found
4. exit
why not a daemon? gemini suggested a supervisor loop, and that's fine as one scheduling option. the tick itself is stateless either way — it reads local state, acts, exits. no long-running process to leak memory or zombie.
pidfile guard: yes, stale pids are a real problem. the tick checks kill -0 $pid and also checks /proc/$pid/cmdline on linux to verify it's actually a bunny process, not a recycled PID. if stale, clean up and continue.
every hop reads memory before starting and writes memory after finishing. not optional. enforced by gates.
before phase 1, the factory MUST:
- classify the task — what area of the codebase does this touch? (quick LLM call or keyword match)
- load relevant memory — not everything. routed by area:
bny/memory/architecture.md— always (it's the map)bny/memory/defects.md— grep for entries matching the areabny/memory/patterns.md— grep for entries matching the areabny/memory/anti-patterns.md— grep for entries matching the areabny/memory/decisions.md— grep for entries matching the area
- load recent hop context — last 2-3 hop summaries from git log or PR comments (if online)
- load steering —
bny/steering.mdif non-empty
this is injected as ## what you already know in the spec prompt.
why routing instead of dumping? gemini flagged "lost in the middle" syndrome. shoving 50k tokens of memory into every prompt degrades attention. route by area — a database feature reads database defects, not UI patterns. keeps context tight.
token budget: hard cap of ~8k tokens for memory injection. if routed content exceeds this, truncate by recency (newest entries first).
after phase 4, before the PR can be marked done, the factory MUST:
- update defects — new defects from verify get added to
bny/memory/defects.md - update patterns — if a test strategy caught a real bug, record it
- update anti-patterns — if an approach failed, record why
- update architecture — if module boundaries changed
- update decisions — any significant "why" choices
- commit memory changes — in the same branch, pushed with the PR
the PR cannot be marked done until memory is committed. this is the gate.
who writes memory? claude, via a structured prompt at end of build phase. replaces the optional ruminate step with a mandatory memorize step that outputs structured updates to each memory file.
format degradation: gemini flagged that LLMs corrupt markdown structure over time. mitigation: each memory file has a strict format. the memorize step outputs JSON operations ({ file, action: "append"|"update", entry: {...} }). a deterministic function applies them. the LLM never directly edits the markdown.
during the hop, observations accumulate in PR comments (if online) and local files:
- challenge findings →
specs/{feature}/challenge.md(already exists) - test failures → logged by ralph, summarized at phase end
- verify findings →
specs/{feature}/verify.md - ralph retries → iteration count per round recorded in state
these feed the post-hop memorize step.
bny steer "skip the cache optimization, focus on error handling"writes to bny/steering.md. the tick or the running hop checks this file at phase boundaries. consumed entries are removed.
human comments on the active PR:
@bunny focus on error handling, skip the caching optimization
the tick polls the active PR for new comments (once per tick, not continuously). new @bunny comments are:
- appended to
bny/steering.md - marked with a reaction (👀) so they're not re-processed
- consumed at the next phase boundary
steering granularity: phase boundaries only. you cannot interrupt a running LLM call. if a human says @bunny stop, the current phase finishes, then the factory reads the stop directive and halts. this is honest about what's possible.
parsing: simple prefix match. @bunny or /bunny at start of comment. free-form text after. no command syntax — the LLM interprets the intent. keeps it simple.
defects are the highest-value memories. structured format in bny/memory/defects.md:
# Defect Catalog
## D-001: off-by-one in pagination
- **area:** data-access
- **found-by:** verify:adversarial, PR #23
- **root-cause:** boundary condition on last page
- **caught-by:** boundary test (round 3)
- **pattern:** always test N, N-1, N+1 for count/index operations
- **status:** fixed
## D-002: race condition in cache invalidation
- **area:** caching
- **found-by:** verify:adversarial, PR #25
- **root-cause:** missing lock on shared map
- **caught-by:** property test (round 2)
- **pattern:** any shared mutable state needs concurrency test
- **status:** fixedhow it compounds:
- pre-hop: test-gen reads defects for the relevant area → writes targeted regression tests
- round 3 (boundaries): specifically targets patterns from past defects
- verify: checks if known defect patterns recur in new code
- the factory literally learns from its mistakes
issue (intent) or `bny hop "description"`
↓
create worktree + branch
↓
classify task area
↓
pre-hop: load routed memory + recent hops + steering
↓
create draft PR (attempt)
↓
brainstorm (claude) — explore alternatives, clarify intent
↓ steering check
specify (claude) + challenge (gemini)
↓ steering check
review:spec-compliance (gemini) — did challenge drift from spec?
↓ steering check
plan (claude) + tasks (claude)
↓ steering check
review:spec-compliance (gemini) — did plan drift from spec?
↓ steering check
test-gen 3×3 narrowing (gemini gen, claude impl)
↓ steering check [TDD gate: src/ changes allowed from here]
implement (claude + ralph)
↓ steering check
verify (gemini — adversarial + behavioral)
↓
memorize (fresh agent — mandatory gate)
↓
commit memory + merge worktree + close PR + close issue
the pipeline is data, not code. the above is the default hop pipeline. spike drops challenge, review, and verify. custom pipelines can rearrange skills freely. memorize is always appended.
worktree lifecycle:
- created at hop start, on a feature branch
- ralph resets to clean state between retries (
git checkout -- .) - on success: merged to main, worktree removed
- on failure: worktree preserved for inspection, main untouched
failure at any phase:
- write partial learnings to memory (what worked before the failure)
- post error to PR comment (if online)
- set
failedstate inbny/state.json+ PR labels - leave issue open for next attempt
- worktree preserved — human can inspect, the next hop starts with memory of what went wrong
- adversarial multi-LLM — claude builds, gemini breaks
- 3×3 narrowing — progressive test hardening
- ralph retry loops — exponential backoff
- assassin process management — child cleanup on signals
- guardrails — constraint injection
- labels as state machine — for human visibility on PRs
- sanitization — strip paths/secrets before github
- POD, snake_case, guard-early — coding conventions
| bunny1 | bunny2 | why |
|---|---|---|
| hardcoded phase sequence in hop.ts | composable skill pipeline (array of functions) | rearrange without rewiring |
| shared working directory | worktree per hop (main is sacred) | isolation, parallel hops, clean rollback |
| no pre-spec exploration | brainstorm skill before specify | reduce spec drift, surface alternatives |
| verify only at end | spec-compliance review mid-pipeline | catch drift early, self-correct |
| no file-path enforcement | TDD gate: no src/ changes until test-gen | prevent premature implementation |
| memory opt-in | memory mandatory (gates) | never forget |
| no pre-hop context | routed memory load | start warm |
| full memory dump | area-routed, budget-capped | avoid "lost in the middle" |
| ruminate (in build context) | memorize (fresh agent, artifacts only) | unbiased memory, historian not participant |
| LLM edits markdown directly | JSON ops → deterministic apply | prevent format rot |
| github as state source | local-first, github as projection | works offline, no split-brain |
| wiki for memory | bny/memory/ in-repo |
versioned with code |
| dev/bg daemon | stateless tick (any scheduler) | no long-running process |
| no steering | local + PR comment steering | mid-run direction |
| defects not tracked | structured defect catalog | learn from mistakes |
| roadmap.md only | issues as backlog (optional) | shared, linked |
| brane = separate system | memory = part of the loop | unified |
- wiki as state — memory is in-repo, not a separate git repo
- github as primary state — local-first. github is projection
- dev/bg as daemon — replaced by stateless tick
- ruminate as optional — replaced by mandatory memorize gate (fresh agent)
- brane as separate subsystem — memory IS the brane, built into the loop
- shared working directory — replaced by worktree per hop
- memory routing accuracy — keyword grep vs. LLM classification for area matching? grep is faster and dumber. LLM is slower and smarter. start with grep, upgrade if needed
- memory file growth — what happens when defects.md has 500 entries? pagination? archival? prune old fixed defects?
- multi-hop features — a feature spanning 3 hops creates 3 PRs. how do we link the narrative? issue as anchor (multiple PRs reference one issue)?
- brownfield bootstrap — for existing projects,
bny digest .to seed memory? or manual curation? - token budget tuning — 8k cap is a guess. need to measure actual useful context size
- memorize prompt engineering — the quality of memory depends entirely on the memorize prompt. this is the hardest prompt to get right
- run 10 hops sequentially. hop 10 should reference learnings from hop 1
- kill the process mid-hop. tick should detect and resume within 2 minutes
- drop steering via
bny steer. next phase should reflect it - introduce a known defect pattern. factory should catch it based on defect catalog
- unplug ethernet. factory should still complete the current hop (minus PR updates)
git log bny/memory/should show memory growing and evolving across hops- run two hops in parallel on different features. worktrees isolate them. main stays clean
- a failed hop leaves main untouched. worktree can be inspected. next hop starts clean
- ralph retries start from clean worktree state, not corrupted partial changes
- spec-compliance review catches plan drift before test-gen runs
- brainstorm.md shows alternatives explored before spec was written
- rearrange the skill pipeline array. factory executes the new order without code changes
for transparency, these are the key criticisms from gemini's adversarial review that shaped this converged design:
- "github as primary state is a distributed system in disguise" — accepted. flipped to local-first
- "wiki is a separate git repo, creates split-brain" — accepted. moved memory in-repo
- "cron pidfile guards are fragile" — accepted. added
/proc/$pid/cmdlineverification - "LLMs will corrupt markdown structure" — accepted. added JSON ops → deterministic apply
- "memory dump causes lost-in-the-middle" — accepted. added area routing + token budget
- "factory should work offline" — accepted. github is projection, not dependency
- "steering only works at phase boundaries" — accepted. documented honestly
- "use a supervisor loop, not cron" — partially accepted. tick is scheduler-agnostic
rejected:
- "local steering only" — remote steering via PR comments is too valuable for unattended runs. worth the polling cost
- "labels are insane for state" — labels are fine for projection. they're not the source of truth, just the human-readable mirror
this is the converged design. ready for implementation planning when you are.