A system for evaluating whether AI agents actually learn.
A typical framing of AI agent evaluation focuses on task completion: did the agent produce the right output? And as it goes with these framings, the word "right" is doing enormous work. Task completion tells you whether an agent succeeded once. It says nothing about whether the agent will succeed better next time.
This harness measures something different: learning. Did patterns extracted from one campaign transfer to another? Did failures encountered once become failures prevented thereafter?
We chart two extremes. At one end: an agent that starts fresh every time, making the same mistakes across runs. At the other: an agent that accumulates knowledge, surfaces relevant patterns, avoids known pitfalls.
The harness tests both directions:
webhook-handler → adapter-builder → sync-service → transform-pipe
│ │ │ │
└─────────────────┴────────────────┴───────────────┘
│
memory.json accumulates
Each template runs sequentially. Patterns extracted from webhook-handler are available to adapter-builder. Failures catalogued in sync-service become pre-flight checks for transform-pipe.
Does accumulated knowledge compound?
┌─────────────────────────┐
│ anki errors │
│ pipeline refactor │
└─────────────────────────┘
│
patterns from transfer track
Different domains, same accumulated memory. Integration patterns learned from webhook pipelines injected into a flashcard app, an error parser, a refactoring task.
Do patterns transfer across domains?
Six dimensions of orchestration quality:
- Cognitive Efficiency — Tokens per useful output
- Structural Fidelity — Protocol adherence
- Decision Quality — Right choices at branch points
- Pattern Emergence — Knowledge extracted and reused (SAVE)
- Error Recovery — Graceful failure handling
- Knowledge Accumulation — System improves over time (LOAD)
The first two are measurable directly. The remaining four emerge through reflection—tooling prompts, humans notice.
cd ftl_eval
# Full transfer learning evaluation
./scripts/transfer_eval.sh
# Single template run
./meta_eval.sh v1 anki
# See what exists
./eval.sh status
# Compare runs
./eval.sh compare anki-v1 anki-v2run → capture → reflect → learn → integrate
↑ ↓
└──── FTL improves ──┘
- Run: Execute a campaign against a template
- Capture: Extract metrics, transcripts, memory deltas
- Reflect: Generate prompts surfacing what to look for
- Learn: Extract insights, update the chronicle
- Integrate: Create decision records that feed back into FTL
Knowledge lives in a single file. Flat structure, tagged entries:
{
"version": "2.0",
"patterns": [
{
"name": "transform-plus-isoformat",
"when": "Delta includes date fields stored in SQLite",
"do": "Use db.create(transform=True) AND compare with .isoformat()",
"signal": 7,
"tags": ["date", "sqlite"]
}
],
"failures": [
{
"name": "date-string-mismatch",
"symptom": "date comparison returns TypeError",
"fix": "Add .isoformat() to date.today()",
"prevent": "grep -E 'date\\.today\\(\\)' *.py | grep -v isoformat"
}
]
}Patterns encode when and do. Failures encode symptom, fix, and prevent.
The signal field tracks reinforcement: patterns used successfully gain weight; patterns that fail lose it.
Each run produces artifacts at two locations:
Results (scratch, not committed):
scratch/results/{version}/{template}/
├── campaign.log
├── memory_before.json
├── memory_after.json
├── injection.json
└── agent-*.jsonl
Evidence (committed):
evidence/runs/{run-id}/
├── metrics.json # Epiplexity, learning, memory delta
├── transcript.md # Human-readable agent trace
├── info_theory.json # Structural analysis
└── evaluation/ # Evaluator outputs
Three values characterize run structure:
- ST (Structural): How much follows learnable patterns
- HT (Entropy): Retries, fallbacks, unexpected branches
- IGR (Information Gain Ratio): ST / (ST + HT)
jq '.epiplexity' evidence/runs/anki-v1/metrics.json
# {"ST": 58.7, "HT": 7.4, "IGR": 0.89, "interpretation": "highly structured"}IGR above 0.8 means predictable execution. Below 0.5, exploring more than exploiting. The question is whether the ratio moves in the right direction across runs.
| Template | What It Tests |
|---|---|
anki |
Flashcard app with spaced repetition. Protocol fidelity, date handling. |
pipeline |
CSV data pipeline. Multi-task lineage, error propagation. |
errors |
Config parser with validation. Error recovery, edge cases. |
refactor |
Task manager enhancement. Existing tests must pass. |
Each template defines a campaign.md (what to build) and test_app.py (verification).
ftl_eval/
├── eval.sh # Unified entry point
├── meta_eval.sh # 8-phase single-template loop
├── meta_eval_suite.sh # Multi-template parallel suite
│
├── scripts/
│ ├── run.sh # propagate → setup → campaign → collect
│ ├── setup.sh # Create environment, seed memory
│ ├── campaign.sh # Run Claude campaign
│ ├── collect.sh # Collect agent logs
│ ├── transfer_eval.sh # Sequential transfer evaluation
│ ├── eval-save.sh # Evaluate memory SAVE quality
│ └── eval-load.sh # Evaluate memory LOAD quality
│
├── instruments/
│ ├── capture.py # Metrics, transcripts, memory delta
│ ├── compare.py # Delta analysis between runs
│ └── info_theory.py # Epiplexity computation
│
├── templates/ # Test environments
├── memory/ # Cross-run learning
├── evidence/ # Captured artifacts
└── reflections/ # Human-driven observation
| Command | Description |
|---|---|
./eval.sh run <template> <version> |
Full evaluation cycle |
./eval.sh capture <run-id> |
Extract evidence from results |
./eval.sh compare <old> <new> |
Delta analysis |
./eval.sh reflect <run-id> |
Generate reflection prompts |
./eval.sh status |
Show available runs and evidence |
./scripts/transfer_eval.sh |
Full transfer learning evaluation |
./meta_eval.sh <version> <template> |
8-phase single template loop |
| Metric | Target | Interpretation |
|---|---|---|
| Save Quality | >7/10 | Patterns extracted are reusable |
| Load Quality | >7/10 | Injected patterns influenced behavior |
| Pattern Utilization | >60% | Available patterns were used |
| Token Reduction | >20% vs baseline | Memory reduces exploration cost |
| Epiplexity IGR | >0.8 | Structured execution |
We're measuring whether patterns learned from building webhook handlers help build flashcard apps. Whether failures encountered in one domain become warnings in another. Whether FTL—the full orchestration of agents, memory, and protocol—actually improves over time.
We log everything. We look at the data. The findings feed back into the system being measured.