Test an LLM agent pipeline with fixtures: zero API calls, zero tokens, deterministic CI (Go)

How to dry-run a multi-step AI agent pipeline against recorded fixtures so go test and CI never hit the model, the CRM, or a real inbox.

Last tested: June 2026. See Changelog at the bottom.

If this saves you from burning tokens in CI, follow @renezander030 — production notes on testing agents that take real actions.

Working implementation: github.com/renezander030/draftcat (Go, MIT) — the draftcat test command below ships in the repo.

The problem: your agent pipeline calls an LLM, then writes to a CRM, then emails a customer. You cannot put that in CI. Mocking the SDK client is brittle, and a recorded HTTP cassette (VCR-style) breaks the moment a prompt changes. What you actually want is to drive each step from a small JSON file and assert the pipeline routes correctly, with no network at all.

TL;DR cheat sheet

Goal	Do this
Run a pipeline with no network	`draftcat test <pipeline>` — reads `fixtures/<pipeline>/<step>.json`
Feed a deterministic step its data	`fixtures/<pipeline>/<step>.json` = `{"data": {...}}`
Feed an AI step a model response	`{"text": "<verbatim model JSON>"}` (validated against the skill schema)
Drive a human-approval step	`{"action": "approve\|skip\|adjust", "text": "..."}`
Seed the pipeline before step 1	optional `fixtures/<pipeline>/_input.json`
Force a rejection path	`draftcat test <pipeline> --reject`
Fail CI on a missing AI fixture	AI step with no fixture exits non-zero (by design)

The rule of thumb: one fixture file per step that needs input, named after the step. Deterministic steps load a data map, AI steps load a recorded response string, approval steps load a decision. No fixture, no network — the runner stubs the channel.

Recommended setup

One command, no flags, against the example pipeline shipped in the repo:

git clone https://github.com/renezander030/draftcat && cd draftcat
go build -o draftcat .
./draftcat test test-screener

That walks a 4-step pipeline (deterministic input > AI classify > human approval > deterministic log) entirely from fixtures/test-screener/. Exit code is 0 on success, non-zero on any step error, so it drops straight into go test or a CI job.

1. The fixture layout

fixtures/<pipeline-name>/
  _input.json         (optional) merged into the data map before any step runs
  <step-name>.json    one file per step that needs input

The runner joins fixtures/ + the pipeline name, then looks up <step>.json as it walks each step. Anything it does not find, it stubs.

2. The three fixture shapes

Each step type reads a different shape. This is the whole contract:

Step type	Fixture shape	If the file is missing
`deterministic`	`{"data": { ... }}` — merged into the pipeline data map	no-op (skipped)
`ai`	`{"text": "<verbatim model response>"}` — validated against the skill's output schema	error, non-zero exit — you must supply it
`approval`	`{"action": "approve\|skip\|adjust", "text": "<optional feedback>"}`	auto-approve (or skip with `--reject`)

The asymmetry is deliberate. A missing AI fixture is a hard error because a silently-skipped model call would make a green test meaningless. A missing approval fixture defaults to approve so the happy path needs zero ceremony, and you opt into the rejection path explicitly.

Steal-able fixtures

fixtures/test-screener/mock-input.json — seed a deterministic step:

{
  "data": {
    "input": "Title: Senior AI/LLM Engineer - Build RAG Pipeline for Legal Documents\nRate: $80-120/hr\nSkills: RAG, Vector Databases, Claude API, TypeScript"
  }
}

fixtures/test-screener/classify.json — a recorded model response. Note the response is a JSON string, exactly what the model would emit, so the schema validator runs on it the same way it does in production:

{ "text": "{\"score\": 4, \"reason\": \"Strong fit: RAG + Claude API + TypeScript\", \"reject\": false}" }

fixtures/test-screener/review.json — the operator decision:

{ "action": "approve" }

3. The pipeline these fixtures drive

The fixture names map 1:1 to step names in config.yaml:

- name: test-screener
  schedule: manual
  steps:
    - name: mock-input            # deterministic > mock-input.json
      type: deterministic
    - name: classify              # ai > classify.json (validated vs skill schema)
      type: ai
      skill: classify-job
      vars:
        profile: "AI/LLM, TypeScript, Go, cloud infra. $60/hr min."
    - name: review                # approval > review.json
      type: approval
      mode: hitl
      channel: telegram
    - name: log-result            # deterministic, no fixture > no-op in test mode
      type: deterministic

4. The runner, in ~40 lines

The whole trick is swapping the AI call, the connectors, and the approval channel for fixture lookups while walking the real step list. Production routing logic runs; only the side effects are stubbed.

// runTestPipeline walks pipeline steps using fixtures instead of real
// connectors / AI / approval. Returns a process exit code.
func runTestPipeline(cfg *config.Config, p config.PipelineConfig,
	skills *skillsapi.SkillRegistry, ch *stubChannel, fixDir string) int {

	data := map[string]interface{}{}

	for _, step := range p.Steps {
		switch step.Type {
		case "ai":
			// resolve the skill prompt + output schema, render {{vars}}...
			fix, found, err := loadFixture(fixDir, step.Name)
			if err != nil {
				fmt.Fprintf(os.Stderr, "  [error] %v\n", err)
				return 1
			}
			if !found {
				fmt.Fprintf(os.Stderr, "  [error] ai step %q has no fixture "+
					"%s/%s.json — supply {\"text\": \"...\"}\n", step.Name, fixDir, step.Name)
				return 1 // missing AI fixture is fatal, not a silent pass
			}
			text, _ := fix["text"].(string)
			if text == "" {
				fmt.Fprintf(os.Stderr, "  [error] fixture %s/%s.json: "+
					"missing or empty 'text' field\n", fixDir, step.Name)
				return 1
			}
			// same schema validator as production:
			parsed, err := validateOutput(text, schema)
			if err != nil {
				fmt.Fprintf(os.Stderr, "  [error] output validation failed: %v\n", err)
				return 1
			}
			data["ai_output"] = parsed

		case "approval":
			fix, found, _ := loadFixture(fixDir, step.Name)
			if found {
				action, _ := fix["action"].(string)
				if action == "" {
					action = "approve"
				}
				ch.decisions = append(ch.decisions, OperatorDecision{Action: action})
			}
			decision, _ := ch.SendForApproval(context.Background(), draftMsg)
			data["approved"] = decision.Action == "approve"

		case "deterministic":
			// load {"data": {...}} if present, else no-op
		}
	}
	return 0
}

stubChannel implements the same approval-channel interface as the Telegram/Slack channel, but answers from the fixture instead of a real human. The schema validator (validateOutput) is the identical function production uses, so a recorded response that would fail validation in prod also fails the test.

5. Smoke test: what a passing run prints

$ ./draftcat test test-screener
[test] pipeline=test-screener fixtures=fixtures/test-screener
[step:mock-input] type=deterministic
  [loaded] fixture into data
[step:classify] type=ai
  [prompt] Score this job 1-5 for fit. Profile: AI/LLM, TypeScript, Go... Job: Senior AI/LLM Engineer...
  [output] map[reason:Strong fit: RAG + Claude API + TypeScript score:4 reject:false]
[step:review] type=approval
  [approval-draft]
    [test] draft: map[reason:Strong fit... score:4 reject:false]
  [approval-decision] approve (from fixture)
[step:log-result] type=deterministic
  [skip] no fixture fixtures/test-screener/log-result.json (deterministic step is a no-op in test mode)
[final data]
  { "ai_output": {"reason":"Strong fit...","reject":false,"score":4}, "approved": true, ... }
[test] OK

Pass criteria: last line is [test] OK and exit code is 0. Pipe it into a test:

func TestScreenerPipeline(t *testing.T) {
	if code := runTest([]string{"test-screener"}); code != 0 {
		t.Fatalf("test-screener pipeline failed, exit=%d", code)
	}
}

6. Drop it into CI (steal-able config)

No secrets, no model key, no network. GitHub Actions:

name: pipeline-fixtures
on: [push, pull_request]
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-go@v5
        with: { go-version: '1.22' }
      - run: go build -o draftcat .
      - run: ./draftcat test test-screener            # happy path
      - run: ./draftcat test test-screener --reject    # rejection path routes correctly

Because there is no API key in the job, a leaked-fixture PR can never spend money or email a customer from CI.

7. Common errors, verbatim

test: pipeline "X" not found in config.yaml The pipeline name argument does not match any name: under pipelines:. Names are case-sensitive.

[error] ai step "classify" has no fixture fixtures/test-screener/classify.json — supply {"text": "..."} Every ai step needs a recorded response. Create the file with a text field containing the verbatim model output.

[error] fixture .../classify.json: missing or empty 'text' field The file exists but text is absent or empty. The runner will not invent a response.

[error] output validation failed: ... Your recorded text does not satisfy the skill's output schema (wrong type, missing required key, enum violation). This is the test doing its job: the same response would fail in production.

8. Debug flow

[test] pipeline=... fixtures=<dir> not printed > pipeline name typo, see error above.
Stops at an ai step > missing or empty fixture, or schema mismatch. Check the [error] line.
Pipeline ends early at an approval step > a fixture set "action": "skip", or you passed --reject.
A deterministic step shows [skip] no fixture > expected; deterministic steps are no-ops without a {"data": ...} file.
Green locally, red in CI > you have an untracked fixture file. git status fixtures/.

Why fixtures instead of mocks or HTTP cassettes

Vs mocking the SDK client: you mock the boundary you do not control and it drifts from the real client's behavior. Fixtures mock the data, and the real schema validator still runs.
Vs recorded HTTP cassettes (VCR): a cassette is keyed on the exact request bytes, so it shatters when you tweak a prompt. A fixture is keyed on the step name, so prompt edits do not invalidate it.
Vs a live "cheap model" in CI: still nondeterministic, still costs money, still needs a key in the runner. Fixtures are byte-stable and free.

Series

This is Production AI Automation Notes #11. The series covers approval gates, token budgets, SQLite dedup, prompt-injection defense, PDF cite verification, and deterministic step pipelines — the discipline of running LLM agents outside a demo.

#1 Agent Approval Gates — proposed actions, schema validation, audit log
#9 LLM cost tracking — per-model price model on top of token budgets
#10 Deterministic step pipelines — fixed typed steps; the LLM never picks the next action

Reference implementation: draftcat (Go, MIT). Follow @renezander030 for new entries.

Reader contributions

How do you test agent pipelines today? Drop a comment with: language, framework (LangGraph / CrewAI / custom), how you stub the model (mock / VCR / fixture / live), and what broke last time a prompt changed.

Changelog

2026-06-15

Initial publish. Covers fixture layout, the three step-type shapes, the runner, smoke test, CI config, verbatim error strings, debug flow.
Skipped gates: hardware matrix (not hardware-bound), model-picks table (topic is testing harness, not model selection).

renezander030/README.md

Select an option

No results found