Skip to content

Instantly share code, notes, and snippets.

@williamp44
Created April 5, 2026 19:30
Show Gist options
  • Select an option

  • Save williamp44/04ebf25705de10a9ba546b6bdc7c17e4 to your computer and use it in GitHub Desktop.

Select an option

Save williamp44/04ebf25705de10a9ba546b6bdc7c17e4 to your computer and use it in GitHub Desktop.
The /probe pattern: catching AI hallucinations before they become code

The /probe Pattern: Catching AI Hallucinations Before They Become Code

I've been running a pattern called /probe against AI-generated code before I write anything, and it keeps catching bugs the AI had no idea it was about to cause.

The shape is simple. Before I write code based on AI output, I force each AI-asserted fact into a numbered CLAIM with an EXPECTED value, then run a command against the real system to check. I capture the delta. Surprises become tests.

The core move: claims are the AI's own prior confidence, made auditable.

The anchor story

My tmux prefix+v binding (capture current pane scrollback, pipe through vim) stopped working for Claude Code sessions. Root cause: I'd set CLAUDE_CODE_NO_FLICKER=1 weeks ago to stop the scroll-jump flicker during streaming. That flag switches Claude into the terminal's alternate screen buffer. No scrollback. prefix+v captures one visible page, nothing more.

Fine. Pivot. Claude Code persists each session as a JSONL file under ~/.claude/projects/<encoded-cwd>/<session-uuid>.jsonl. I'll parse that directly. I asked Claude Code to describe the JSONL format and propose a shell function that pipes the most recent session through jq into vim. The AI confidently described the format. Two top-level types. Assistant content = text + tool_use. User content = array. Folder names replace / with -.

I ran /probe against that description before writing the jq filter. Four hallucinations fell out:

  1. AI said 2 top-level JSONL types (user, assistant). Reality: 7 types.
  2. AI said assistant content = text + tool_use. Missed thinking blocks, about a third of output in extended thinking mode.
  3. AI said user content is always an array. Actually polymorphic: string OR array.
  4. AI said folder naming replaces / with -. Actually prepend dash, then replace.

Each would have been a silent bug. The jq filter would have errored on string-form user content, dumped thinking blocks as garbage, and missed 5 of 7 message types entirely.

The probe caught them because the AI had to write "EXPECTED: 2 types" before running jq -r '.type' file.jsonl | sort -u. Saying the number first makes the delta visible.

The REPL is the ground-truth oracle

This is the generalization of the pattern. Mocks describe what you think the system does. Static analysis describes what the code says it does. The REPL shows what the system actually does when you poke it.

For the JSONL probe above, the "REPL" was just jq on a real file. For a Clojure service, it's a live nREPL. For a database, it's sqlite3 or psql. For an HTTP API, it's curl. The concrete tool varies. The discipline is the same: write the claim, then hit the running system, then diff.

If you already work in a Lisp, Python, or Ruby REPL, you recognize this as REPL-driven development applied to AI-generated specs. The AI proposes a shape of the world. You evaluate a cheap expression against the real world. The expression's output overrides the AI's description. That override value becomes the oracle that your tests lock in.

The pattern doesn't require a REPL in the Lisp sense. It requires a cheap way to ask the real system a question and get back a string you can trust. Whatever that is for your stack, use it.

Why the highest-confidence claims are where the hallucinations live

If the AI says "I'm not sure, check this one," you already know to check. If it flatly states X with no hedge, you don't. The high-confidence claims are the ones worth probing, because that's where the hallucinations hide: the places where the AI is confident and wrong in some small load-bearing way.

Traditional TDD vs probe-driven TDD

This is the payoff I keep coming back to.

Traditional TDD: you write the test based on what you THINK should happen. Probe-driven TDD: you write the test based on what you VERIFIED happens.

The probe step converts "my mental model of the system" into "an oracle value from the actual system." The test's expected value is no longer your guess. It is the system's observed behavior, captured once, reused forever. Mocks test your model of the system. The probe tests the system itself.

One probe, N permanent guardrails

The point of the probe is not the probe. It's what the probe becomes.

Probe finding Guardrail
"7 top-level types: [list]" schema test that fails CI if a new type appears
"content blocks: text / thinking / tool_use" exhaustiveness test in the parser's case dispatch
"user content is string OR array" property test that fuzzes both shapes
"folder naming: prepend dash + replace /" unit test with oracle value

When the upstream format changes, the test fails, I re-probe, and the oracle updates. That is how probe findings become durable.

The limit: unknown unknowns

The probe only catches claims the AI thinks to make. A few things help narrow the gap:

  • Discover before asserting. Run jq 'keys' or jq '.type' | sort -u first to enumerate reality, then generate claims about what you observed. This converts unknowns into knowns before the claim-generation step.
  • Generate the gap list. Dex Horthy's CRISPY pattern (HumanLayer) pushes the AI to surface questions it cannot answer from the codebase. The valuable output is the gap list.
  • [NEEDS CLARIFICATION] markers. GitHub's Spec Kit uses this convention to force the AI to write the string literally where it's guessing. It makes blind spots textual.
  • Human veto on the claim list. The AI can't see its own blind spots. I can. A thirty-second scan of the claim list catches things the AI would never have surfaced.

Postscript: the probe improved itself

Running this specific probe exposed a gap in my own /probe skill definition: the protocol jumped straight from "state what's unknown" to "generate claims." No enumeration step. Claims were being generated from memory and context, not from enumerated reality. That is exactly the unknown-unknowns failure mode above, baked into the tool.

I added "Step 2: Enumerate-First Discovery" to the skill. Before writing claims, run jq 'keys', .schema, hexdump, or whatever the domain calls for. Enumerate first. THEN write claims about the observed shape, not the imagined shape. Rule I wrote into the skill: "never write a claim about a field, type, or key you have not first enumerated from a real instance."

The probe session caught a flaw in the probe skill itself. That is also the pattern working.

It's a workflow, not a tool

There's no /probe binary. Any AI will do this if you ask it to. The steps:

  1. Extract the AI's factual claims, numbered.
  2. Attach an EXPECTED value to each.
  3. Run a real command against the real system.
  4. Diff.
  5. Turn surprises into tests.

The full skill file (the one I load into Claude Code as a slash command) is in probe-skill.md in this gist. It has the 7-step protocol, enumeration commands by domain, the gate check, and integration points with other skills I built in my workflow.

The AI is a better collaborator when its confidence is legible. /probe is the cheapest way I've found to make it legible.

name probe
description Structured REPL spike to verify facts before writing specs. Produces verified oracle values, not prose. Triggers on: probe, spike first, verify facts, check assumptions, what don't we know.

/probe >> Verify Facts Against Reality Before Writing Specs

A mandatory pre-spec phase that produces REPL-verified facts, not opinions. The output is scratch/<slug>-probe.md, a short file of empirical claims, each tagged with how it was verified. This file becomes the oracle source for downstream task generation.

Guiding Principles

  • Boris Cherny (built Claude Code): "80% Plan Mode, 3 annotation rounds, you don't trust, you instrument"
  • Dex Horthy (HumanLayer): "A bad line of research produces thousands of bad lines of code"
  • Addy Osmani (Google): "Waterfall in 15 minutes, have AI interrogate YOU about edge cases before you finalize the spec"
  • Martin Fowler (Thoughtworks): "Match spec complexity to task complexity", a one-line bug fix doesn't need a full probe
  • Rich Hickey: Hammock time, sit with the problem before coding. The probe IS the hammock, but with a REPL.
  • CRISPY (Dex Horthy): Generate questions the AI CAN'T answer from the codebase. The valuable output is the gap list, not the knowledge.
  • Spec Kit (GitHub): Use [NEEDS CLARIFICATION] markers, force the AI to flag ambiguity instead of guessing.

Why This Exists

Specs fail when they contain unverified assumptions. When the expected values in a test come from what the AI thought the system does (instead of what the system actually does when you hit it), you get wrong oracle values, wrong struct sizes, wrong field names. Every one of those would have been caught by 10 minutes of probing real data.

When to Use

REQUIRED when the spec will contain:

  • Binary format details (byte offsets, struct sizes, field types)
  • Oracle values (expected outputs for differential testing)
  • Integration assumptions (what does the other system expect?)
  • Legacy code behavior (what does the old code actually return?)
  • API contracts (what does the endpoint actually send?)

SKIP when:

  • Pure refactor (no new behavior, existing tests cover it)
  • Documentation only
  • UI tweak with no backend changes

The Protocol

Step 1: Human States What's Unknown (2 min)

Ask the user: "What facts does the spec need that we have NOT yet verified against reality?"

If the user isn't sure, help by scanning the work order and identifying claims that depend on:

  • Byte layouts, struct sizes, field offsets
  • Return values of existing functions
  • Behavior on edge cases (NULL, empty, deleted, out-of-range)
  • Performance characteristics at production scale
  • Integration contracts (what format does the caller expect?)

Step 2: Enumerate-First Discovery (5 min)

Before generating claims, enumerate reality to discover what exists.

AI agents can only probe claims they think to make. A field or type the AI has never seen will never appear in its claim list. Enumeration converts unknown-unknowns into known-knowns BEFORE the claim-generation step.

For each unknown from Step 1, run a cheap shape-probe FIRST. Record what you find. THEN write claims against the observed shape, not the imagined shape.

Enumeration commands by domain:

Domain Enumeration command
JSON/JSONL jq 'keys', jq '.field' | sort -u, jq '.type' | sort | uniq -c
SQLite .schema, .tables, PRAGMA table_info(T)
PostgreSQL \d+ tablename, SELECT column_name, data_type FROM information_schema.columns WHERE table_name='T'
API response curl ... | jq 'keys' then drill into each key
Binary format hexdump -C -n 512 file | head, file <path>
Directory find . -type f -name '*.X' | wc -l, ls -la
C/C++ header grep -E '^(class|struct|enum)' foo.h
Clojure (keys m), (type x), (set (map type coll))
YAML/config yq 'keys', env | grep PREFIX

Rule: never write a claim about a field, type, or key you have not first enumerated from a real instance. The discovery output IS the claim-generation input. You are not allowed to invent content-shape names from memory.

Record enumeration results inline in the probe doc under a ## Enumeration section (before ## Verified Facts) so future readers can see what the claim list was generated FROM.

Step 3: Generate Probe Plan (3 min)

Produce a numbered list of 5-15 specific empirical claims to verify. Each claim has:

CLAIM N: [specific factual statement]
METHOD: [exact command, REPL eval, hexdump, curl, grep, sizeof, ls]
EXPECTED: [what we think the answer is, or "UNKNOWN"]

The probe plan is environment-aware:

  • Clojure service: REPL eval, slurp+parse, verify against live state
  • HTTP API: curl endpoints, check rate limits, verify response schema
  • Binary format: hex-dump the file, count bytes, check magic numbers
  • Database: run actual queries, inspect schema, compare to assumptions
  • Browser: Chrome CDP or playwright to inspect running app behavior

Step 4: Execute Probes (5-20 min)

Run each probe. For REPL-accessible probes, generate and execute the commands. For probes requiring human action (open an app, look at a hex dump, check running system), ask the human to do it and report back.

Critical rule: The human MUST touch the real system at least once. If it's a binary format >> hex-dump the file. If it's an API >> curl it. If it's legacy code >> open the header/source file. The AI cannot substitute for physical contact with reality.

Record each result immediately:

CLAIM N: [statement]
METHOD: [command run]
RESULT: [actual output]
STATUS: PROVEN | DISPROVEN | MODIFIED | UNKNOWN
NOTES: [if disproven, what's the correct value]

Step 5: Write Probe Findings (2 min)

Save to scratch/<slug>-probe.md in the project directory. Format:

# Probe: <feature name>
Date: YYYY-MM-DD

## Verified Facts
| # | Claim | Verified Value | Method | Status |
|---|-------|---------------|--------|--------|
| 1 | JSONL has N top-level types | 7 types | jq '.type' file.jsonl | sort -u | PROVEN |
| 2 | assistant content block kinds | text, thinking, tool_use | jq on 100 rows | PROVEN |
| 3 | user content polymorphism | string OR array | type check on .content | MODIFIED |

## Surprises
- [anything that contradicted expectations]
- [anything the work order didn't mention but matters]

## What Could Go Wrong (max 10 lines)
1. [risk 1]
2. [risk 2]
3. [risk 3]

## Oracle Values for Spec
[Specific expected outputs that the spec should use in acceptance criteria]

## [NEEDS CLARIFICATION] (Spec Kit pattern)
- [questions that remain unanswered, these MUST be resolved before writing the spec]

Step 6: Gate Check

Before proceeding to spec/task generation, verify:

  • Zero [NEEDS CLARIFICATION] items remaining (all resolved or explicitly deferred)
  • All oracle values have REPL-verified sources (not code comments, not docs, not AI guesses)
  • "What Could Go Wrong" list reviewed by human
  • Human has physically touched the real system at least once

If the gate fails, resolve the remaining items or explicitly mark them as risks.

Step 7: Optional Hammock Time

Ask: "Proceed to spec generation, or sleep on it?"

Hickey's hammock time is real. If the probe found surprises, sleeping on it and coming back tomorrow with fresh eyes costs 2 minutes of re-reading the probe doc but may prevent a fundamental design mistake.

Integration with Other Skills

This skill sits in a pipeline of custom Claude Code skills I built. None of these are Claude Code built-ins. You can wire your own equivalents; what matters is that /probe occupies the slot BETWEEN "requirements are clear" and "tasks are generated."

Upstream: /intent-to-contract

A custom skill that turns rough user intent into a locked work order (requirements clarification). The locked work order feeds /probe as context. The probe does NOT re-ask "what are we building?", that's settled. The probe asks "what don't we know about the system we're building ON?"

Downstream: /prd-tasks

A custom skill that generates a spec + prescriptive task list from a work order (task generation). /prd-tasks MUST read the probe file if it exists.

  • Oracle values in acceptance criteria come from the probe, not AI guesses
  • [NEEDS CLARIFICATION] items from the probe become explicit spike tasks in the spec
  • The probe's "What Could Go Wrong" feeds into the spec's risk section

Related: /prove

A custom skill that proves code matches oracle values via REPL verification at implementation time. The probe identifies WHAT to prove. /prove executes the proof. The probe's oracle values become /prove's expected outputs.

Related: /linus-review

A custom skill that runs a Linus-flavored code review after implementation. /linus-review checks code quality AFTER implementation. /probe checks assumptions BEFORE implementation. They are complementary, not competing.

Anti-Patterns

  • Don't write a design doc. The probe is facts, not architecture decisions.
  • Don't spend >60 min. If you're still probing after an hour, the feature is too big, split it.
  • Don't let the AI guess oracle values. Every value must have a REPL session excerpt or command output backing it.
  • Don't skip the human. The AI can generate REPL commands, but the human must review the results and name what's surprising.
  • Don't probe trivial features. If the spec has no oracle values and no integration boundaries, skip /probe and go straight to task generation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment