Skip to content

Instantly share code, notes, and snippets.

@enzokro
Created January 11, 2026 00:49
Show Gist options
  • Select an option

  • Save enzokro/b0ca1ca8a6efdb22f86db97c506c9567 to your computer and use it in GitHub Desktop.

Select an option

Save enzokro/b0ca1ca8a6efdb22f86db97c506c9567 to your computer and use it in GitHub Desktop.
The ftl Eval Harness

FTL Evaluation Harness

A system for evaluating whether AI agents actually learn.


The Problem

A typical framing of AI agent evaluation focuses on task completion: did the agent produce the right output? And as it goes with these framings, the word "right" is doing enormous work. Task completion tells you whether an agent succeeded once. It says nothing about whether the agent will succeed better next time.

This harness measures something different: learning. Did patterns extracted from one campaign transfer to another? Did failures encountered once become failures prevented thereafter?


Two Tracks of Transfer

We chart two extremes. At one end: an agent that starts fresh every time, making the same mistakes across runs. At the other: an agent that accumulates knowledge, surfaces relevant patterns, avoids known pitfalls.

The harness tests both directions:

Transfer Track (Sequential)

webhook-handler → adapter-builder → sync-service → transform-pipe
       │                 │                │               │
       └─────────────────┴────────────────┴───────────────┘
                                 │
                     memory.json accumulates

Each template runs sequentially. Patterns extracted from webhook-handler are available to adapter-builder. Failures catalogued in sync-service become pre-flight checks for transform-pipe.

Does accumulated knowledge compound?

Generalization Track (Parallel)

         ┌─────────────────────────┐
         │   anki    errors        │
         │   pipeline  refactor    │
         └─────────────────────────┘
                     │
         patterns from transfer track

Different domains, same accumulated memory. Integration patterns learned from webhook pipelines injected into a flashcard app, an error parser, a refactoring task.

Do patterns transfer across domains?


What We Measure

Six dimensions of orchestration quality:

  1. Cognitive Efficiency — Tokens per useful output
  2. Structural Fidelity — Protocol adherence
  3. Decision Quality — Right choices at branch points
  4. Pattern Emergence — Knowledge extracted and reused (SAVE)
  5. Error Recovery — Graceful failure handling
  6. Knowledge Accumulation — System improves over time (LOAD)

The first two are measurable directly. The remaining four emerge through reflection—tooling prompts, humans notice.


Quick Start

cd ftl_eval

# Full transfer learning evaluation
./scripts/transfer_eval.sh

# Single template run
./meta_eval.sh v1 anki

# See what exists
./eval.sh status

# Compare runs
./eval.sh compare anki-v1 anki-v2

The Evaluation Flow

run → capture → reflect → learn → integrate
                   ↑                    ↓
                   └──── FTL improves ──┘
  1. Run: Execute a campaign against a template
  2. Capture: Extract metrics, transcripts, memory deltas
  3. Reflect: Generate prompts surfacing what to look for
  4. Learn: Extract insights, update the chronicle
  5. Integrate: Create decision records that feed back into FTL

Memory Format

Knowledge lives in a single file. Flat structure, tagged entries:

{
  "version": "2.0",
  "patterns": [
    {
      "name": "transform-plus-isoformat",
      "when": "Delta includes date fields stored in SQLite",
      "do": "Use db.create(transform=True) AND compare with .isoformat()",
      "signal": 7,
      "tags": ["date", "sqlite"]
    }
  ],
  "failures": [
    {
      "name": "date-string-mismatch",
      "symptom": "date comparison returns TypeError",
      "fix": "Add .isoformat() to date.today()",
      "prevent": "grep -E 'date\\.today\\(\\)' *.py | grep -v isoformat"
    }
  ]
}

Patterns encode when and do. Failures encode symptom, fix, and prevent.

The signal field tracks reinforcement: patterns used successfully gain weight; patterns that fail lose it.


Evidence Structure

Each run produces artifacts at two locations:

Results (scratch, not committed):

scratch/results/{version}/{template}/
├── campaign.log
├── memory_before.json
├── memory_after.json
├── injection.json
└── agent-*.jsonl

Evidence (committed):

evidence/runs/{run-id}/
├── metrics.json          # Epiplexity, learning, memory delta
├── transcript.md         # Human-readable agent trace
├── info_theory.json      # Structural analysis
└── evaluation/           # Evaluator outputs

Epiplexity

Three values characterize run structure:

  • ST (Structural): How much follows learnable patterns
  • HT (Entropy): Retries, fallbacks, unexpected branches
  • IGR (Information Gain Ratio): ST / (ST + HT)
jq '.epiplexity' evidence/runs/anki-v1/metrics.json
# {"ST": 58.7, "HT": 7.4, "IGR": 0.89, "interpretation": "highly structured"}

IGR above 0.8 means predictable execution. Below 0.5, exploring more than exploiting. The question is whether the ratio moves in the right direction across runs.


Templates

Template What It Tests
anki Flashcard app with spaced repetition. Protocol fidelity, date handling.
pipeline CSV data pipeline. Multi-task lineage, error propagation.
errors Config parser with validation. Error recovery, edge cases.
refactor Task manager enhancement. Existing tests must pass.

Each template defines a campaign.md (what to build) and test_app.py (verification).


Directory Structure

ftl_eval/
├── eval.sh                 # Unified entry point
├── meta_eval.sh            # 8-phase single-template loop
├── meta_eval_suite.sh      # Multi-template parallel suite
│
├── scripts/
│   ├── run.sh              # propagate → setup → campaign → collect
│   ├── setup.sh            # Create environment, seed memory
│   ├── campaign.sh         # Run Claude campaign
│   ├── collect.sh          # Collect agent logs
│   ├── transfer_eval.sh    # Sequential transfer evaluation
│   ├── eval-save.sh        # Evaluate memory SAVE quality
│   └── eval-load.sh        # Evaluate memory LOAD quality
│
├── instruments/
│   ├── capture.py          # Metrics, transcripts, memory delta
│   ├── compare.py          # Delta analysis between runs
│   └── info_theory.py      # Epiplexity computation
│
├── templates/              # Test environments
├── memory/                 # Cross-run learning
├── evidence/               # Captured artifacts
└── reflections/            # Human-driven observation

Commands

Command Description
./eval.sh run <template> <version> Full evaluation cycle
./eval.sh capture <run-id> Extract evidence from results
./eval.sh compare <old> <new> Delta analysis
./eval.sh reflect <run-id> Generate reflection prompts
./eval.sh status Show available runs and evidence
./scripts/transfer_eval.sh Full transfer learning evaluation
./meta_eval.sh <version> <template> 8-phase single template loop

Success Criteria

Metric Target Interpretation
Save Quality >7/10 Patterns extracted are reusable
Load Quality >7/10 Injected patterns influenced behavior
Pattern Utilization >60% Available patterns were used
Token Reduction >20% vs baseline Memory reduces exploration cost
Epiplexity IGR >0.8 Structured execution

The Point

We're measuring whether patterns learned from building webhook handlers help build flashcard apps. Whether failures encountered in one domain become warnings in another. Whether FTL—the full orchestration of agents, memory, and protocol—actually improves over time.

We log everything. We look at the data. The findings feed back into the system being measured.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment