Skip to content

Instantly share code, notes, and snippets.

@renezander030
Created June 15, 2026 18:07
Show Gist options
  • Select an option

  • Save renezander030/a058fc0d5e7e7fa209d30cfa48e82ebb to your computer and use it in GitHub Desktop.

Select an option

Save renezander030/a058fc0d5e7e7fa209d30cfa48e82ebb to your computer and use it in GitHub Desktop.
Test an LLM agent pipeline with fixtures: zero API calls, zero tokens, deterministic CI (Go)

Test an LLM agent pipeline with fixtures: zero API calls, zero tokens, deterministic CI (Go)

How to dry-run a multi-step AI agent pipeline against recorded fixtures so go test and CI never hit the model, the CRM, or a real inbox.

Last tested: June 2026. See Changelog at the bottom.

If this saves you from burning tokens in CI, follow @renezander030 — production notes on testing agents that take real actions.

Working implementation: github.com/renezander030/draftcat (Go, MIT) — the draftcat test command below ships in the repo.

The problem: your agent pipeline calls an LLM, then writes to a CRM, then emails a customer. You cannot put that in CI. Mocking the SDK client is brittle, and a recorded HTTP cassette (VCR-style) breaks the moment a prompt changes. What you actually want is to drive each step from a small JSON file and assert the pipeline routes correctly, with no network at all.

TL;DR cheat sheet

Goal Do this
Run a pipeline with no network draftcat test <pipeline> — reads fixtures/<pipeline>/<step>.json
Feed a deterministic step its data fixtures/<pipeline>/<step>.json = {"data": {...}}
Feed an AI step a model response {"text": "<verbatim model JSON>"} (validated against the skill schema)
Drive a human-approval step {"action": "approve|skip|adjust", "text": "..."}
Seed the pipeline before step 1 optional fixtures/<pipeline>/_input.json
Force a rejection path draftcat test <pipeline> --reject
Fail CI on a missing AI fixture AI step with no fixture exits non-zero (by design)

The rule of thumb: one fixture file per step that needs input, named after the step. Deterministic steps load a data map, AI steps load a recorded response string, approval steps load a decision. No fixture, no network — the runner stubs the channel.

Recommended setup

One command, no flags, against the example pipeline shipped in the repo:

git clone https://github.com/renezander030/draftcat && cd draftcat
go build -o draftcat .
./draftcat test test-screener

That walks a 4-step pipeline (deterministic input > AI classify > human approval > deterministic log) entirely from fixtures/test-screener/. Exit code is 0 on success, non-zero on any step error, so it drops straight into go test or a CI job.


1. The fixture layout

fixtures/<pipeline-name>/
  _input.json         (optional) merged into the data map before any step runs
  <step-name>.json    one file per step that needs input

The runner joins fixtures/ + the pipeline name, then looks up <step>.json as it walks each step. Anything it does not find, it stubs.

2. The three fixture shapes

Each step type reads a different shape. This is the whole contract:

Step type Fixture shape If the file is missing
deterministic {"data": { ... }} — merged into the pipeline data map no-op (skipped)
ai {"text": "<verbatim model response>"} — validated against the skill's output schema error, non-zero exit — you must supply it
approval {"action": "approve|skip|adjust", "text": "<optional feedback>"} auto-approve (or skip with --reject)

The asymmetry is deliberate. A missing AI fixture is a hard error because a silently-skipped model call would make a green test meaningless. A missing approval fixture defaults to approve so the happy path needs zero ceremony, and you opt into the rejection path explicitly.

Steal-able fixtures

fixtures/test-screener/mock-input.json — seed a deterministic step:

{
  "data": {
    "input": "Title: Senior AI/LLM Engineer - Build RAG Pipeline for Legal Documents\nRate: $80-120/hr\nSkills: RAG, Vector Databases, Claude API, TypeScript"
  }
}

fixtures/test-screener/classify.json — a recorded model response. Note the response is a JSON string, exactly what the model would emit, so the schema validator runs on it the same way it does in production:

{ "text": "{\"score\": 4, \"reason\": \"Strong fit: RAG + Claude API + TypeScript\", \"reject\": false}" }

fixtures/test-screener/review.json — the operator decision:

{ "action": "approve" }

3. The pipeline these fixtures drive

The fixture names map 1:1 to step names in config.yaml:

- name: test-screener
  schedule: manual
  steps:
    - name: mock-input            # deterministic > mock-input.json
      type: deterministic
    - name: classify              # ai > classify.json (validated vs skill schema)
      type: ai
      skill: classify-job
      vars:
        profile: "AI/LLM, TypeScript, Go, cloud infra. $60/hr min."
    - name: review                # approval > review.json
      type: approval
      mode: hitl
      channel: telegram
    - name: log-result            # deterministic, no fixture > no-op in test mode
      type: deterministic

4. The runner, in ~40 lines

The whole trick is swapping the AI call, the connectors, and the approval channel for fixture lookups while walking the real step list. Production routing logic runs; only the side effects are stubbed.

// runTestPipeline walks pipeline steps using fixtures instead of real
// connectors / AI / approval. Returns a process exit code.
func runTestPipeline(cfg *config.Config, p config.PipelineConfig,
	skills *skillsapi.SkillRegistry, ch *stubChannel, fixDir string) int {

	data := map[string]interface{}{}

	for _, step := range p.Steps {
		switch step.Type {
		case "ai":
			// resolve the skill prompt + output schema, render {{vars}}...
			fix, found, err := loadFixture(fixDir, step.Name)
			if err != nil {
				fmt.Fprintf(os.Stderr, "  [error] %v\n", err)
				return 1
			}
			if !found {
				fmt.Fprintf(os.Stderr, "  [error] ai step %q has no fixture "+
					"%s/%s.json — supply {\"text\": \"...\"}\n", step.Name, fixDir, step.Name)
				return 1 // missing AI fixture is fatal, not a silent pass
			}
			text, _ := fix["text"].(string)
			if text == "" {
				fmt.Fprintf(os.Stderr, "  [error] fixture %s/%s.json: "+
					"missing or empty 'text' field\n", fixDir, step.Name)
				return 1
			}
			// same schema validator as production:
			parsed, err := validateOutput(text, schema)
			if err != nil {
				fmt.Fprintf(os.Stderr, "  [error] output validation failed: %v\n", err)
				return 1
			}
			data["ai_output"] = parsed

		case "approval":
			fix, found, _ := loadFixture(fixDir, step.Name)
			if found {
				action, _ := fix["action"].(string)
				if action == "" {
					action = "approve"
				}
				ch.decisions = append(ch.decisions, OperatorDecision{Action: action})
			}
			decision, _ := ch.SendForApproval(context.Background(), draftMsg)
			data["approved"] = decision.Action == "approve"

		case "deterministic":
			// load {"data": {...}} if present, else no-op
		}
	}
	return 0
}

stubChannel implements the same approval-channel interface as the Telegram/Slack channel, but answers from the fixture instead of a real human. The schema validator (validateOutput) is the identical function production uses, so a recorded response that would fail validation in prod also fails the test.

5. Smoke test: what a passing run prints

$ ./draftcat test test-screener
[test] pipeline=test-screener fixtures=fixtures/test-screener
[step:mock-input] type=deterministic
  [loaded] fixture into data
[step:classify] type=ai
  [prompt] Score this job 1-5 for fit. Profile: AI/LLM, TypeScript, Go... Job: Senior AI/LLM Engineer...
  [output] map[reason:Strong fit: RAG + Claude API + TypeScript score:4 reject:false]
[step:review] type=approval
  [approval-draft]
    [test] draft: map[reason:Strong fit... score:4 reject:false]
  [approval-decision] approve (from fixture)
[step:log-result] type=deterministic
  [skip] no fixture fixtures/test-screener/log-result.json (deterministic step is a no-op in test mode)
[final data]
  { "ai_output": {"reason":"Strong fit...","reject":false,"score":4}, "approved": true, ... }
[test] OK

Pass criteria: last line is [test] OK and exit code is 0. Pipe it into a test:

func TestScreenerPipeline(t *testing.T) {
	if code := runTest([]string{"test-screener"}); code != 0 {
		t.Fatalf("test-screener pipeline failed, exit=%d", code)
	}
}

6. Drop it into CI (steal-able config)

No secrets, no model key, no network. GitHub Actions:

name: pipeline-fixtures
on: [push, pull_request]
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-go@v5
        with: { go-version: '1.22' }
      - run: go build -o draftcat .
      - run: ./draftcat test test-screener            # happy path
      - run: ./draftcat test test-screener --reject    # rejection path routes correctly

Because there is no API key in the job, a leaked-fixture PR can never spend money or email a customer from CI.

7. Common errors, verbatim

test: pipeline "X" not found in config.yaml The pipeline name argument does not match any name: under pipelines:. Names are case-sensitive.

[error] ai step "classify" has no fixture fixtures/test-screener/classify.json — supply {"text": "..."} Every ai step needs a recorded response. Create the file with a text field containing the verbatim model output.

[error] fixture .../classify.json: missing or empty 'text' field The file exists but text is absent or empty. The runner will not invent a response.

[error] output validation failed: ... Your recorded text does not satisfy the skill's output schema (wrong type, missing required key, enum violation). This is the test doing its job: the same response would fail in production.

8. Debug flow

  1. [test] pipeline=... fixtures=<dir> not printed > pipeline name typo, see error above.
  2. Stops at an ai step > missing or empty fixture, or schema mismatch. Check the [error] line.
  3. Pipeline ends early at an approval step > a fixture set "action": "skip", or you passed --reject.
  4. A deterministic step shows [skip] no fixture > expected; deterministic steps are no-ops without a {"data": ...} file.
  5. Green locally, red in CI > you have an untracked fixture file. git status fixtures/.

Why fixtures instead of mocks or HTTP cassettes

  • Vs mocking the SDK client: you mock the boundary you do not control and it drifts from the real client's behavior. Fixtures mock the data, and the real schema validator still runs.
  • Vs recorded HTTP cassettes (VCR): a cassette is keyed on the exact request bytes, so it shatters when you tweak a prompt. A fixture is keyed on the step name, so prompt edits do not invalidate it.
  • Vs a live "cheap model" in CI: still nondeterministic, still costs money, still needs a key in the runner. Fixtures are byte-stable and free.

Series

This is Production AI Automation Notes #11. The series covers approval gates, token budgets, SQLite dedup, prompt-injection defense, PDF cite verification, and deterministic step pipelines — the discipline of running LLM agents outside a demo.

Reference implementation: draftcat (Go, MIT). Follow @renezander030 for new entries.

Reader contributions

How do you test agent pipelines today? Drop a comment with: language, framework (LangGraph / CrewAI / custom), how you stub the model (mock / VCR / fixture / live), and what broke last time a prompt changed.

Changelog

2026-06-15

  • Initial publish. Covers fixture layout, the three step-type shapes, the runner, smoke test, CI config, verbatim error strings, debug flow.
  • Skipped gates: hardware matrix (not hardware-bound), model-picks table (topic is testing harness, not model selection).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment