I've been running a pattern called /probe against AI-generated code before I write anything, and it keeps catching bugs the AI had no idea it was about to cause.
The shape is simple. Before I write code based on AI output, I force each AI-asserted fact into a numbered CLAIM with an EXPECTED value, then run a command against the real system to check. I capture the delta. Surprises become tests.
The core move: claims are the AI's own prior confidence, made auditable.
My tmux prefix+v binding (capture current pane scrollback, pipe through vim) stopped working for Claude Code sessions. Root cause: I'd set CLAUDE_CODE_NO_FLICKER=1 weeks ago to stop the scroll-jump flicker during streaming. That flag switches Claude into the terminal's alternate screen buffer. No scrollback. prefix+v captures one visible page, nothing more.
Fine. Pivot. Claude Code persists each session as a JSONL file under ~/.claude/projects/<encoded-cwd>/<session-uuid>.jsonl. I'll parse that directly. I asked Claude Code to describe the JSONL format and propose a shell function that pipes the most recent session through jq into vim. The AI confidently described the format. Two top-level types. Assistant content = text + tool_use. User content = array. Folder names replace / with -.
I ran /probe against that description before writing the jq filter. Four hallucinations fell out:
- AI said 2 top-level JSONL types (user, assistant). Reality: 7 types.
- AI said assistant content = text + tool_use. Missed
thinkingblocks, about a third of output in extended thinking mode. - AI said user content is always an array. Actually polymorphic: string OR array.
- AI said folder naming replaces
/with-. Actually prepend dash, then replace.
Each would have been a silent bug. The jq filter would have errored on string-form user content, dumped thinking blocks as garbage, and missed 5 of 7 message types entirely.
The probe caught them because the AI had to write "EXPECTED: 2 types" before running jq -r '.type' file.jsonl | sort -u. Saying the number first makes the delta visible.
This is the generalization of the pattern. Mocks describe what you think the system does. Static analysis describes what the code says it does. The REPL shows what the system actually does when you poke it.
For the JSONL probe above, the "REPL" was just jq on a real file. For a Clojure service, it's a live nREPL. For a database, it's sqlite3 or psql. For an HTTP API, it's curl. The concrete tool varies. The discipline is the same: write the claim, then hit the running system, then diff.
If you already work in a Lisp, Python, or Ruby REPL, you recognize this as REPL-driven development applied to AI-generated specs. The AI proposes a shape of the world. You evaluate a cheap expression against the real world. The expression's output overrides the AI's description. That override value becomes the oracle that your tests lock in.
The pattern doesn't require a REPL in the Lisp sense. It requires a cheap way to ask the real system a question and get back a string you can trust. Whatever that is for your stack, use it.
If the AI says "I'm not sure, check this one," you already know to check. If it flatly states X with no hedge, you don't. The high-confidence claims are the ones worth probing, because that's where the hallucinations hide: the places where the AI is confident and wrong in some small load-bearing way.
This is the payoff I keep coming back to.
Traditional TDD: you write the test based on what you THINK should happen. Probe-driven TDD: you write the test based on what you VERIFIED happens.
The probe step converts "my mental model of the system" into "an oracle value from the actual system." The test's expected value is no longer your guess. It is the system's observed behavior, captured once, reused forever. Mocks test your model of the system. The probe tests the system itself.
The point of the probe is not the probe. It's what the probe becomes.
| Probe finding | Guardrail |
|---|---|
| "7 top-level types: [list]" | schema test that fails CI if a new type appears |
| "content blocks: text / thinking / tool_use" | exhaustiveness test in the parser's case dispatch |
| "user content is string OR array" | property test that fuzzes both shapes |
| "folder naming: prepend dash + replace /" | unit test with oracle value |
When the upstream format changes, the test fails, I re-probe, and the oracle updates. That is how probe findings become durable.
The probe only catches claims the AI thinks to make. A few things help narrow the gap:
- Discover before asserting. Run
jq 'keys'orjq '.type' | sort -ufirst to enumerate reality, then generate claims about what you observed. This converts unknowns into knowns before the claim-generation step. - Generate the gap list. Dex Horthy's CRISPY pattern (HumanLayer) pushes the AI to surface questions it cannot answer from the codebase. The valuable output is the gap list.
- [NEEDS CLARIFICATION] markers. GitHub's Spec Kit uses this convention to force the AI to write the string literally where it's guessing. It makes blind spots textual.
- Human veto on the claim list. The AI can't see its own blind spots. I can. A thirty-second scan of the claim list catches things the AI would never have surfaced.
Running this specific probe exposed a gap in my own /probe skill definition: the protocol jumped straight from "state what's unknown" to "generate claims." No enumeration step. Claims were being generated from memory and context, not from enumerated reality. That is exactly the unknown-unknowns failure mode above, baked into the tool.
I added "Step 2: Enumerate-First Discovery" to the skill. Before writing claims, run jq 'keys', .schema, hexdump, or whatever the domain calls for. Enumerate first. THEN write claims about the observed shape, not the imagined shape. Rule I wrote into the skill: "never write a claim about a field, type, or key you have not first enumerated from a real instance."
The probe session caught a flaw in the probe skill itself. That is also the pattern working.
There's no /probe binary. Any AI will do this if you ask it to. The steps:
- Extract the AI's factual claims, numbered.
- Attach an EXPECTED value to each.
- Run a real command against the real system.
- Diff.
- Turn surprises into tests.
The full skill file (the one I load into Claude Code as a slash command) is in probe-skill.md in this gist. It has the 7-step protocol, enumeration commands by domain, the gate check, and integration points with other skills I built in my workflow.
The AI is a better collaborator when its confidence is legible. /probe is the cheapest way I've found to make it legible.