FTL Evaluation Harness

A system for evaluating whether AI agents actually learn.

The Problem

A typical framing of AI agent evaluation focuses on task completion: did the agent produce the right output? And as it goes with these framings, the word "right" is doing enormous work. Task completion tells you whether an agent succeeded once. It says nothing about whether the agent will succeed better next time.

This harness measures something different: learning. Did patterns extracted from one campaign transfer to another? Did failures encountered once become failures prevented thereafter?

Two Tracks of Transfer

We chart two extremes. At one end: an agent that starts fresh every time, making the same mistakes across runs. At the other: an agent that accumulates knowledge, surfaces relevant patterns, avoids known pitfalls.

The harness tests both directions:

Transfer Track (Sequential)

webhook-handler → adapter-builder → sync-service → transform-pipe
       │                 │                │               │
       └─────────────────┴────────────────┴───────────────┘
                                 │
                     memory.json accumulates

Each template runs sequentially. Patterns extracted from webhook-handler are available to adapter-builder. Failures catalogued in sync-service become pre-flight checks for transform-pipe.

Does accumulated knowledge compound?

Generalization Track (Parallel)

         ┌─────────────────────────┐
         │   anki    errors        │
         │   pipeline  refactor    │
         └─────────────────────────┘
                     │
         patterns from transfer track

Different domains, same accumulated memory. Integration patterns learned from webhook pipelines injected into a flashcard app, an error parser, a refactoring task.

Do patterns transfer across domains?

What We Measure

Six dimensions of orchestration quality:

Cognitive Efficiency — Tokens per useful output
Structural Fidelity — Protocol adherence
Decision Quality — Right choices at branch points
Pattern Emergence — Knowledge extracted and reused (SAVE)
Error Recovery — Graceful failure handling
Knowledge Accumulation — System improves over time (LOAD)

The first two are measurable directly. The remaining four emerge through reflection—tooling prompts, humans notice.

Quick Start

cd ftl_eval

# Full transfer learning evaluation
./scripts/transfer_eval.sh

# Single template run
./meta_eval.sh v1 anki

# See what exists
./eval.sh status

# Compare runs
./eval.sh compare anki-v1 anki-v2

The Evaluation Flow

run → capture → reflect → learn → integrate
                   ↑                    ↓
                   └──── FTL improves ──┘

Run: Execute a campaign against a template
Capture: Extract metrics, transcripts, memory deltas
Reflect: Generate prompts surfacing what to look for
Learn: Extract insights, update the chronicle
Integrate: Create decision records that feed back into FTL

Memory Format

Knowledge lives in a single file. Flat structure, tagged entries:

{
  "version": "2.0",
  "patterns": [
    {
      "name": "transform-plus-isoformat",
      "when": "Delta includes date fields stored in SQLite",
      "do": "Use db.create(transform=True) AND compare with .isoformat()",
      "signal": 7,
      "tags": ["date", "sqlite"]
    }
  ],
  "failures": [
    {
      "name": "date-string-mismatch",
      "symptom": "date comparison returns TypeError",
      "fix": "Add .isoformat() to date.today()",
      "prevent": "grep -E 'date\\.today\\(\\)' *.py | grep -v isoformat"
    }
  ]
}

Patterns encode when and do. Failures encode symptom, fix, and prevent.

The signal field tracks reinforcement: patterns used successfully gain weight; patterns that fail lose it.

Evidence Structure

Each run produces artifacts at two locations:

Results (scratch, not committed):

scratch/results/{version}/{template}/
├── campaign.log
├── memory_before.json
├── memory_after.json
├── injection.json
└── agent-*.jsonl

Evidence (committed):

evidence/runs/{run-id}/
├── metrics.json          # Epiplexity, learning, memory delta
├── transcript.md         # Human-readable agent trace
├── info_theory.json      # Structural analysis
└── evaluation/           # Evaluator outputs

Epiplexity

Three values characterize run structure:

ST (Structural): How much follows learnable patterns
HT (Entropy): Retries, fallbacks, unexpected branches
IGR (Information Gain Ratio): ST / (ST + HT)

jq '.epiplexity' evidence/runs/anki-v1/metrics.json
# {"ST": 58.7, "HT": 7.4, "IGR": 0.89, "interpretation": "highly structured"}

IGR above 0.8 means predictable execution. Below 0.5, exploring more than exploiting. The question is whether the ratio moves in the right direction across runs.

Templates

Template	What It Tests
`anki`	Flashcard app with spaced repetition. Protocol fidelity, date handling.
`pipeline`	CSV data pipeline. Multi-task lineage, error propagation.
`errors`	Config parser with validation. Error recovery, edge cases.
`refactor`	Task manager enhancement. Existing tests must pass.

Each template defines a campaign.md (what to build) and test_app.py (verification).

Directory Structure

ftl_eval/
├── eval.sh                 # Unified entry point
├── meta_eval.sh            # 8-phase single-template loop
├── meta_eval_suite.sh      # Multi-template parallel suite
│
├── scripts/
│   ├── run.sh              # propagate → setup → campaign → collect
│   ├── setup.sh            # Create environment, seed memory
│   ├── campaign.sh         # Run Claude campaign
│   ├── collect.sh          # Collect agent logs
│   ├── transfer_eval.sh    # Sequential transfer evaluation
│   ├── eval-save.sh        # Evaluate memory SAVE quality
│   └── eval-load.sh        # Evaluate memory LOAD quality
│
├── instruments/
│   ├── capture.py          # Metrics, transcripts, memory delta
│   ├── compare.py          # Delta analysis between runs
│   └── info_theory.py      # Epiplexity computation
│
├── templates/              # Test environments
├── memory/                 # Cross-run learning
├── evidence/               # Captured artifacts
└── reflections/            # Human-driven observation

Commands

Command	Description
`./eval.sh run <template> <version>`	Full evaluation cycle
`./eval.sh capture <run-id>`	Extract evidence from results
`./eval.sh compare <old> <new>`	Delta analysis
`./eval.sh reflect <run-id>`	Generate reflection prompts
`./eval.sh status`	Show available runs and evidence
`./scripts/transfer_eval.sh`	Full transfer learning evaluation
`./meta_eval.sh <version> <template>`	8-phase single template loop

Success Criteria

Metric	Target	Interpretation
Save Quality	>7/10	Patterns extracted are reusable
Load Quality	>7/10	Injected patterns influenced behavior
Pattern Utilization	>60%	Available patterns were used
Token Reduction	>20% vs baseline	Memory reduces exploration cost
Epiplexity IGR	>0.8	Structured execution

The Point

We're measuring whether patterns learned from building webhook handlers help build flashcard apps. Whether failures encountered in one domain become warnings in another. Whether FTL—the full orchestration of agents, memory, and protocol—actually improves over time.

We log everything. We look at the data. The findings feed back into the system being measured.

enzokro/README.md

Select an option

No results found