How to dry-run a multi-step AI agent pipeline against recorded fixtures so go test and CI never hit the model, the CRM, or a real inbox.
Last tested: June 2026. See Changelog at the bottom.
If this saves you from burning tokens in CI, follow @renezander030 — production notes on testing agents that take real actions.
Working implementation: github.com/renezander030/draftcat (Go, MIT) — the
draftcat testcommand below ships in the repo.
The problem: your agent pipeline calls an LLM, then writes to a CRM, then emails a customer. You cannot put that in CI. Mocking the SDK client is brittle, and a recorded HTTP cassette (VCR-style) breaks the moment a prompt changes. What you actually want is to drive each step from a small JSON file and assert the pipeline routes correctly, with no network at all.
| Goal | Do this |
|---|---|
| Run a pipeline with no network | draftcat test <pipeline> — reads fixtures/<pipeline>/<step>.json |
| Feed a deterministic step its data | fixtures/<pipeline>/<step>.json = {"data": {...}} |
| Feed an AI step a model response | {"text": "<verbatim model JSON>"} (validated against the skill schema) |
| Drive a human-approval step | {"action": "approve|skip|adjust", "text": "..."} |
| Seed the pipeline before step 1 | optional fixtures/<pipeline>/_input.json |
| Force a rejection path | draftcat test <pipeline> --reject |
| Fail CI on a missing AI fixture | AI step with no fixture exits non-zero (by design) |
The rule of thumb: one fixture file per step that needs input, named after the step. Deterministic steps load a data map, AI steps load a recorded response string, approval steps load a decision. No fixture, no network — the runner stubs the channel.
One command, no flags, against the example pipeline shipped in the repo:
git clone https://github.com/renezander030/draftcat && cd draftcat
go build -o draftcat .
./draftcat test test-screenerThat walks a 4-step pipeline (deterministic input > AI classify > human approval > deterministic log) entirely from fixtures/test-screener/. Exit code is 0 on success, non-zero on any step error, so it drops straight into go test or a CI job.
fixtures/<pipeline-name>/
_input.json (optional) merged into the data map before any step runs
<step-name>.json one file per step that needs input
The runner joins fixtures/ + the pipeline name, then looks up <step>.json as it walks each step. Anything it does not find, it stubs.
Each step type reads a different shape. This is the whole contract:
| Step type | Fixture shape | If the file is missing |
|---|---|---|
deterministic |
{"data": { ... }} — merged into the pipeline data map |
no-op (skipped) |
ai |
{"text": "<verbatim model response>"} — validated against the skill's output schema |
error, non-zero exit — you must supply it |
approval |
{"action": "approve|skip|adjust", "text": "<optional feedback>"} |
auto-approve (or skip with --reject) |
The asymmetry is deliberate. A missing AI fixture is a hard error because a silently-skipped model call would make a green test meaningless. A missing approval fixture defaults to approve so the happy path needs zero ceremony, and you opt into the rejection path explicitly.
fixtures/test-screener/mock-input.json — seed a deterministic step:
{
"data": {
"input": "Title: Senior AI/LLM Engineer - Build RAG Pipeline for Legal Documents\nRate: $80-120/hr\nSkills: RAG, Vector Databases, Claude API, TypeScript"
}
}fixtures/test-screener/classify.json — a recorded model response. Note the response is a JSON string, exactly what the model would emit, so the schema validator runs on it the same way it does in production:
{ "text": "{\"score\": 4, \"reason\": \"Strong fit: RAG + Claude API + TypeScript\", \"reject\": false}" }fixtures/test-screener/review.json — the operator decision:
{ "action": "approve" }The fixture names map 1:1 to step names in config.yaml:
- name: test-screener
schedule: manual
steps:
- name: mock-input # deterministic > mock-input.json
type: deterministic
- name: classify # ai > classify.json (validated vs skill schema)
type: ai
skill: classify-job
vars:
profile: "AI/LLM, TypeScript, Go, cloud infra. $60/hr min."
- name: review # approval > review.json
type: approval
mode: hitl
channel: telegram
- name: log-result # deterministic, no fixture > no-op in test mode
type: deterministicThe whole trick is swapping the AI call, the connectors, and the approval channel for fixture lookups while walking the real step list. Production routing logic runs; only the side effects are stubbed.
// runTestPipeline walks pipeline steps using fixtures instead of real
// connectors / AI / approval. Returns a process exit code.
func runTestPipeline(cfg *config.Config, p config.PipelineConfig,
skills *skillsapi.SkillRegistry, ch *stubChannel, fixDir string) int {
data := map[string]interface{}{}
for _, step := range p.Steps {
switch step.Type {
case "ai":
// resolve the skill prompt + output schema, render {{vars}}...
fix, found, err := loadFixture(fixDir, step.Name)
if err != nil {
fmt.Fprintf(os.Stderr, " [error] %v\n", err)
return 1
}
if !found {
fmt.Fprintf(os.Stderr, " [error] ai step %q has no fixture "+
"%s/%s.json — supply {\"text\": \"...\"}\n", step.Name, fixDir, step.Name)
return 1 // missing AI fixture is fatal, not a silent pass
}
text, _ := fix["text"].(string)
if text == "" {
fmt.Fprintf(os.Stderr, " [error] fixture %s/%s.json: "+
"missing or empty 'text' field\n", fixDir, step.Name)
return 1
}
// same schema validator as production:
parsed, err := validateOutput(text, schema)
if err != nil {
fmt.Fprintf(os.Stderr, " [error] output validation failed: %v\n", err)
return 1
}
data["ai_output"] = parsed
case "approval":
fix, found, _ := loadFixture(fixDir, step.Name)
if found {
action, _ := fix["action"].(string)
if action == "" {
action = "approve"
}
ch.decisions = append(ch.decisions, OperatorDecision{Action: action})
}
decision, _ := ch.SendForApproval(context.Background(), draftMsg)
data["approved"] = decision.Action == "approve"
case "deterministic":
// load {"data": {...}} if present, else no-op
}
}
return 0
}stubChannel implements the same approval-channel interface as the Telegram/Slack channel, but answers from the fixture instead of a real human. The schema validator (validateOutput) is the identical function production uses, so a recorded response that would fail validation in prod also fails the test.
$ ./draftcat test test-screener
[test] pipeline=test-screener fixtures=fixtures/test-screener
[step:mock-input] type=deterministic
[loaded] fixture into data
[step:classify] type=ai
[prompt] Score this job 1-5 for fit. Profile: AI/LLM, TypeScript, Go... Job: Senior AI/LLM Engineer...
[output] map[reason:Strong fit: RAG + Claude API + TypeScript score:4 reject:false]
[step:review] type=approval
[approval-draft]
[test] draft: map[reason:Strong fit... score:4 reject:false]
[approval-decision] approve (from fixture)
[step:log-result] type=deterministic
[skip] no fixture fixtures/test-screener/log-result.json (deterministic step is a no-op in test mode)
[final data]
{ "ai_output": {"reason":"Strong fit...","reject":false,"score":4}, "approved": true, ... }
[test] OK
Pass criteria: last line is [test] OK and exit code is 0. Pipe it into a test:
func TestScreenerPipeline(t *testing.T) {
if code := runTest([]string{"test-screener"}); code != 0 {
t.Fatalf("test-screener pipeline failed, exit=%d", code)
}
}No secrets, no model key, no network. GitHub Actions:
name: pipeline-fixtures
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-go@v5
with: { go-version: '1.22' }
- run: go build -o draftcat .
- run: ./draftcat test test-screener # happy path
- run: ./draftcat test test-screener --reject # rejection path routes correctlyBecause there is no API key in the job, a leaked-fixture PR can never spend money or email a customer from CI.
test: pipeline "X" not found in config.yaml
The pipeline name argument does not match any name: under pipelines:. Names are case-sensitive.
[error] ai step "classify" has no fixture fixtures/test-screener/classify.json — supply {"text": "..."}
Every ai step needs a recorded response. Create the file with a text field containing the verbatim model output.
[error] fixture .../classify.json: missing or empty 'text' field
The file exists but text is absent or empty. The runner will not invent a response.
[error] output validation failed: ...
Your recorded text does not satisfy the skill's output schema (wrong type, missing required key, enum violation). This is the test doing its job: the same response would fail in production.
[test] pipeline=... fixtures=<dir>not printed > pipeline name typo, see error above.- Stops at an
aistep > missing or empty fixture, or schema mismatch. Check the[error]line. - Pipeline ends early at an approval step > a fixture set
"action": "skip", or you passed--reject. - A deterministic step shows
[skip] no fixture> expected; deterministic steps are no-ops without a{"data": ...}file. - Green locally, red in CI > you have an untracked fixture file.
git status fixtures/.
- Vs mocking the SDK client: you mock the boundary you do not control and it drifts from the real client's behavior. Fixtures mock the data, and the real schema validator still runs.
- Vs recorded HTTP cassettes (VCR): a cassette is keyed on the exact request bytes, so it shatters when you tweak a prompt. A fixture is keyed on the step name, so prompt edits do not invalidate it.
- Vs a live "cheap model" in CI: still nondeterministic, still costs money, still needs a key in the runner. Fixtures are byte-stable and free.
This is Production AI Automation Notes #11. The series covers approval gates, token budgets, SQLite dedup, prompt-injection defense, PDF cite verification, and deterministic step pipelines — the discipline of running LLM agents outside a demo.
- #1 Agent Approval Gates — proposed actions, schema validation, audit log
- #9 LLM cost tracking — per-model price model on top of token budgets
- #10 Deterministic step pipelines — fixed typed steps; the LLM never picks the next action
Reference implementation: draftcat (Go, MIT). Follow @renezander030 for new entries.
How do you test agent pipelines today? Drop a comment with: language, framework (LangGraph / CrewAI / custom), how you stub the model (mock / VCR / fixture / live), and what broke last time a prompt changed.
- Initial publish. Covers fixture layout, the three step-type shapes, the runner, smoke test, CI config, verbatim error strings, debug flow.
- Skipped gates: hardware matrix (not hardware-bound), model-picks table (topic is testing harness, not model selection).