AI Harness Engineering

Presentations

2-Factor Agents: Patterns of reliable LLM applications — Dex Horthy, HumanLayer — What are the principles we can use to build LLM-powered software that is actually good enough to put in the hands of production customers?
- 12-Factor Agents - Principles for building reliable LLM applications
No Vibes Allowed: Solving Hard Problems in Complex Codebases – Dex Horthy, HumanLayer — This is about the intersection between context engineering and harness engineering.
- Getting AI to Work in Complex Codebases

Reading

Research

Li, H., Wang, Z., Dai, Q., Nie, Y., Peng, J., Liu, R., ... & Song, D. (2026). OpenSage: Self-programming Agent Generation Engine. arXiv preprint arXiv:2602.16891. — Agent development kits (ADKs) provide effective platforms and tooling for constructing agents, and their designs are critical to the constructed agents' performance, especially the functionality for agent topology, tools, and memory. However, current ADKs either lack sufficient functional support or rely on humans to manually design these components, limiting agents' generalizability and overall performance. We propose OpenSage, the first ADK that enables LLMs to automatically create agents with self-generated topology and toolsets while providing comprehensive and structured memory support. OpenSage offers effective functionality for agents to create and manage their own sub-agents and toolkits. It also features a hierarchical, graph-based memory system for efficient management and a specialized toolkit tailored to software engineering tasks. Extensive experiments across three state-of-the-art benchmarks with various backbone models demonstrate the advantages of OpenSage over existing ADKs. We also conduct rigorous ablation studies to demonstrate the effectiveness of our design for each component. We believe OpenSage can pave the way for the next generation of agent development, shifting the focus from human-centered to AI-centered paradigms.
Lee, Y., Nair, R., Zhang, Q., Lee, K., Khattab, O., & Finn, C. (2026). Meta-Harness: End-to-End Optimization of Model Harnesses. — Meta-Harness takes a different approach: it gives the proposer a filesystem containing the full source code, scores, and execution traces of every prior candidate. The proposer is a coding agent (Claude Code) that reads what it needs via grep, cat, and other standard tools. In practice, this means up to 10M tokens of diagnostic context per step, vs. at most 26K for all prior methods we surveyed. The result is that the proposer can trace a failure back to the specific harness decision that caused it, rather than guessing from a score.
He, Z., Huang, S., Qu, X., Li, Y., Zhu, T., Cheng, Y., & Yang, Y. (2026). GEMS: Agent-Native Multimodal Generation with Memory and Skills. arXiv preprint arXiv:2603.28088.
Pan, L., Zou, L., Guo, S., Ni, J., & Zheng, H. T. (2026, March). Natural-Language Agent Harnesses. arXiv preprint arXiv:2603.25723.
Jiang, P., Lin, J., Shi, Z., Wang, Z., He, L., Wu, Y., ... & Han, J. (2025). Adaptation of agentic ai. arXiv preprint arXiv:2512.16301.
Xia, C. S., Deng, Y., Dunn, S., & Zhang, L. (2025). Demystifying llm-based software engineering agents. Proceedings of the ACM on Software Engineering, 2(FSE), 801-824.
https://github.com/VoltAgent/awesome-ai-agent-papers

Related Research

Asadi, M., O'Sullivan, J. W., Cao, F., Nedaee, T., Fardi, K., Li, F. F., ... & Ashley, E. (2026). Mirage The Illusion of Visual Understanding. arXiv preprint arXiv:2603.21687. — Multimodal AI systems have achieved remarkable performance across a broad range of real-world tasks, yet the mechanisms underlying visual–language reasoning remain surprisingly poorly understood. We report three findings that challenge prevailing assumptions about how these systems process and integrate visual information. First, Frontier models readily generate detailed image descriptions and elaborate reasoning traces, including pathology-biased clinical findings, for images never provided; we term this phenomenon mirage reasoning. Second, without any image input, models also attain strikingly high scores across general and medical multimodal benchmarks, bringing into question their utility and design. In the most extreme case, our model achieved the top rank on a standard chest Xray question-answering benchmark without access to any images. Third, when models were explicitly instructed to guess answers without image access, rather than being implicitly prompted to assume images were present, performance declined markedly. Explicit guessing appears to engage a more conservative response regime, in contrast to the mirage regime in which models behave as though images have been provided. These findings expose fundamental vulnerabilities in how visual–language models reason and are evaluated, pointing to an urgent need for private benchmarks that eliminate textual cues enabling non-visual inference, particularly in medical contexts where miscalibrated AI carries the greatest consequence. We introduce B-Clean as a principled solution for fair, vision-grounded evaluation of multimodal AI systems.
Li, Z., Yang, Z., Zhao, H., Zhao, A., Tang, S., Yang, K., ... & Jin, C. (2026). Goedel-Code-Prover: Hierarchical Proof Search for Open State-of-the-Art Code Verification. arXiv preprint arXiv:2603.19329. — Large language models (LLMs) can generate plausible code but offer limited guarantees of correctness. Formally verifying that implementations satisfy specifications requires constructing machine-checkable proofs, a task that remains beyond current automation. We propose a hierarchical proof search framework for automated code verification in Lean~4 that decomposes complex verification goals into structurally simpler subgoals before attempting tactic-level proving. Central to our approach is a principled decomposition score that combines constructive justification with structural effectiveness. Crucially, this score serves as both the training reward and the inference-time ranking criterion, ensuring strict alignment between optimization and deployment. We train Goedel-Code-Prover-8B, a single unified policy for both decomposition and completion, via supervised initialization followed by hybrid reinforcement learning, where a continuous decomposition reward drives planning exploration while supervised replay stabilizes proof generation. On three Lean-based code verification benchmarks comprising 427 tasks, our 8B-parameter model achieves a 62.0\% prove success rate, a 2.6× improvement over the strongest baseline, surpassing neural provers up to 84× larger. We further observe consistent inference-time scaling: success rates improve monotonically with search iterations and sampling budget, with our trained model achieving greater efficiency than frontier off-the-shelf models of comparable scale.

Agentic Tools/Environments

OpenSage (from the research paper above)

OpenSage (Open Self-programming Agent Generation Engine) is an AI-centric agent framework designed to shift agent development from a human-engineered, fixed paradigm to an AI-driven, self-programming one. Instead of requiring developers to hand-design workflows, tool lists, and memory logic for each task, OpenSage provides a minimal scaffold that lets the model create and orchestrate these components at runtime.

OpenSage is built around three core systems that strongly influence agent performance:

Self-generating agent topology: the agent can dynamically create, execute, and terminate sub-agents during task execution, supporting both vertical agent topology (decomposing a complex task into sequential sub-tasks handled by specialized sub-agents) and horizontal agent topology (multiple sub-agents execute the same task using distinct plans, then merge results via an agent ensemble mechanism).

Dynamic tool synthesis and management: the agent can create tools during execution (e.g., scripts, analyzers, generators), supported by a tooling runtime with tool-specific sandboxing and state management. Agents can also create skills with the OpenSage framework.

Hierarchical Memory Management: target-level long-term memory (a graph database for shareable knowledge) plus execution-based short-term memory (a graph structure for tracking agent runs), with a built-in, dedicated memory agent for memory management that can be enabled with a single line of code.

https://www.opensage-agent.ai/ https://github.com/opensage-agent/opensage-adk

Harbor

Harbor is a framework for evaluating and optimizing agents and models in container environments.

When we released Terminal-Bench in May, we were surprised to see it used in unexpected ways like building custom evals, optimizing prompts, running RL, generating SFT traces, and CI/CD agent testing.

We also learned that defining and managing containerized tasks at scale is hard. We built Harbor to make it easy.

Harbor provides:

Simple, modular interfaces for environments, agents, and tasks

All popular CLI agents pre-integrated

A registry of popular benchmarks and datasets

Integrations with cloud sandbox providers like Daytona, Modal, E2B and Runloop for horizontal scaling

Integrations with frameworks like SkyRL and GEPA for optimizing agents

uv tool install harbor

https://harborframework.com/

Terminus-2

Terminus-2 is Harbor's reference agent implementation, designed as a research-preview agent for evaluating language models' capabilities in terminal environments. It operates entirely autonomously within sandboxed environments and serves as a high-performance neutral test bed for understanding language model agent capabilities.

https://www.harborframework.com/docs/agents/terminus-2

Meta-Harness (from the research paper above)

Meta-Harness: 76.4% on Terminal-Bench 2.0 (Claude Opus 4.6)

Meta-Harness extends the Terminus-KIRA agent with environment bootstrapping: before the agent loop starts, it gathers a snapshot of the sandbox environment (working directory, file listing, available languages/tools, package managers, memory) and injects it into the initial prompt. This saves 2-5 early exploration turns that the agent normally spends on ls, which python3, etc.

The agent was discovered through automated harness evolution. More details coming soon.

https://github.com/stanford-iris-lab/meta-harness-tbench2-artifact

Terminus-KIRA

Terminus-KIRA is an agent harness for Terminal-Bench, built on top of Terminus 2. It boosts frontier model performance on Terminal-Bench through a set of minimal but effective harness-level improvements — native tool calling, multimodal support, execution optimization, and smarter completion verification.

https://github.com/krafton-ai/KIRA

Cersei

The Rust SDK for building coding agents. Tool execution, LLM streaming, graph memory, sub-agent orchestration, MCP — as composable library functions.

https://cersei.pacifio.dev/docs

Abstract CLI

Abstract is a complete CLI coding agent built on the Cersei SDK. One binary, zero runtime dependencies, graph memory by default.

cargo install --git https://github.com/pacifio/cersei abstract-cli

autoresearch

The idea: give an AI agent a small but real LLM training setup and let it experiment autonomously overnight. It modifies the code, trains for 5 minutes, checks if the result improved, keeps or discards, and repeats. You wake up in the morning to a log of experiments and (hopefully) a better model. The training code here is a simplified single-GPU implementation of nanochat. The core idea is that you're not touching any of the Python files like you normally would as a researcher. Instead, you are programming the program.md Markdown files that provide context to the AI agents and set up your autonomous research org. The default program.md in this repo is intentionally kept as a bare bones baseline, though it's obvious how one would iterate on it over time to find the "research org code" that achieves the fastest research progress, how you'd add more agents to the mix, etc. A bit more context on this project is here in this tweet and this tweet.

https://github.com/karpathy/autoresearch

Mux - Coding Agent Multiplexer

Mux is a desktop & browser application for parallel agentic development. It enables developers to plan and execute tasks with multiple AI agents on local or remote compute.

Features

Isolated workspaces with central view on git divergence (docs)

Local: run directly in your project directory

Worktree: git worktrees on your local machine

SSH: remote execution on a server over SSH

Multi-model (sonnet-4-*, grok-*, gpt-5-*, opus-4-*)

Ollama supported for local LLMs (docs)

OpenRouter supported for long-tail of LLMs (docs)

VS Code Extension: Jump into Mux workspaces directly from VS Code (docs)

Supporting UI and keybinds for efficiently managing a suite of agents

Rich markdown outputs (mermaid diagrams, LaTeX, etc.)

Mux has a custom agent loop but much of the core UX is inspired by Claude Code. You'll find familiar features like Plan/Exec mode, vim inputs, /compact and new ones like opportunistic compaction and mode prompts.

https://github.com/coder/mux https://mux.coder.com/

VoltAgent

https://github.com/VoltAgent/voltagent

pochi

Pochi is an AI agent designed for software development. It operates within your IDE, using a toolkit of commands to execute complex tasks, from code generation to project-wide refactoring.

https://github.com/TabbyML/pochi

Tabby

Works with pochi.

Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features:

Self-contained, with no need for a DBMS or cloud service.

OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE).

Supports consumer-grade GPUs.

https://github.com/TabbyML/tabby

pi-coding-agent

Pi is a minimal terminal coding harness. Adapt pi to your workflows, not the other way around. Extend it with TypeScript extensions, skills, prompt templates, and themes. Bundle them as pi packages and share via npm or git.

Pi ships with powerful defaults but skips features like sub-agents and plan mode. Ask pi to build what you want, or install a package that does it your way.

Four modes: interactive, print/JSON, RPC, and SDK. See clawdbot for a real-world integration.

Feynman

Reads papers, searches the web, writes drafts, runs experiments, and cites every claim. All locally on your computer.

Has skills that could probably be adapted to fit any agent.

Feynman

goose

goose is your on-machine AI agent, capable of automating complex development tasks from start to finish. More than just code suggestions, goose can build entire projects from scratch, write and execute code, debug failures, orchestrate workflows, and interact with external APIs - autonomously.

https://github.com/block/goose

Claude Code

Original
Free Code — Maybe not exactly above water, but is a result of the code map being accidentally shipped to NPM.
Claw Code — The fastest repo in history to surpass 100K stars ⭐. Better Harness Tools that make real things done. Built in Rust using oh-my-codex.

OmoiOS

kivo360/OmoiOS — Open-source orchestration runtime that turns specs into PRs using parallel agent swarms in isolated sandboxes.

opencode

The open source coding agent.

opencode

Plandex

💻 Plandex is a terminal-based AI development tool that can plan and execute large coding tasks that span many steps and touch dozens of files. It can handle up to 2M tokens of context directly (~100k per file), and can index directories with 20M tokens or more using tree-sitter project maps.

🔬 A cumulative diff review sandbox keeps AI-generated changes separate from your project files until they are ready to go. Command execution is controlled so you can easily roll back and debug. Plandex helps you get the most out of AI without leaving behind a mess in your project.

🧠 Combine the best models from Anthropic, OpenAI, Google, and open source providers to build entire features and apps with a robust terminal-based workflow.

🚀 Plandex is capable of full autonomy—it can load relevant files, plan and implement changes, execute commands, and automatically debug—but it's also highly flexible and configurable, giving developers fine-grained control and a step-by-step review process when needed.

💪 Plandex is designed to be resilient to large projects and files. If you've found that others tools struggle once your project gets past a certain size or the changes are too complex, give Plandex a shot.

plandex-ai/plandex

Live-SWE-agent

Live-SWE-agent is the first live, runtime self-evolving software engineering agent that expands and revises its own capabilities on the fly while working on a real-world issue. Our key insight is that software agents are themselves software systems, and modern LLM-based agents already possess the intrinsic capability to extend or modify their own behavior at runtime.

https://github.com/OpenAutoCoder/live-swe-agent

Agentless

Agentless is an agentless approach to automatically solve software development problems. To solve each issue, Agentless follows a simple three phase process: localization, repair, and patch validation.

🙀 Localization: Agentless employs a hierarchical process to first localize the fault to specific files, then to relevant classes or functions, and finally to fine-grained edit locations

😼 Repair: Agentless takes the edit locations and samples multiple candidate patches per bug in a simple diff format

😸 Patch Validation: Agentless selects the regression tests to run and generates additional reproduction test to reproduce the original error. Using the test results, Agentless re-ranks all remaining patches to selects one to submit

https://github.com/OpenAutoCoder/Agentless

aider

The OG of terminal agents.

aider — AI pair programming in your terminal

Skills

Repos that could be adapted to work with all agents

obra/superpowers — An agentic skills framework & software development methodology that works.
Syndicate — A Claude Code plugin that spins up a self-governing outfit to do a job. Give it a goal (build an app, write a contract, design a system) and it stands up an organization that attempts the work, scores itself, evolves its approach, and ships when it's done.
AI-Assisted Development Methodology — This is a deterministic, document-driven methodology for human engineers and AI assistants to collaborate on software development. It solves the critical problem of context continuity across AI sessions while maintaining quality through explicit gates.
Jerry Framework — Behavioral guardrails and workflow orchestration for Claude Code. Accrues knowledge, wisdom, experience.
I tested 30+ community Claude Skills for a week. Here’s what actually works (complete list + GitHub links)
hesreallyhim/awesome-claude-code &mdash A curated list of awesome skills, hooks, slash-commands, agent orchestrators, applications, and plugins for Claude Code by Anthropic
https://github.com/VoltAgent/awesome-agent-skills
BehiSecc/awesome-claude-skills — A curated list of Claude Skills.
https://github.com/Jeffallan/claude-skills
https://github.com/VoltAgent/awesome-design-md
FactoryAI Skill
FactoryAI Plugins
simple-codex — is this the Terminal-Bench 2.0 Simple Codex, there's no harbor harness I could find?
https://github.com/VoltAgent/awesome-claude-code-subagents

Other Tools

wingthing — Sandboxed AI agents, reachable from anywhere
rtk — CLI proxy that reduces LLM token consumption by 60-90% on common dev commands. Single Rust binary, zero dependencies

adaburrows/ai-harness-list-and-resources.md

Select an option

No results found

Select an option

No results found

AI Harness Engineering

Presentations

Reading

Research

Related Research

Agentic Tools/Environments

OpenSage (from the research paper above)

Harbor

Terminus-2

Meta-Harness (from the research paper above)

Terminus-KIRA

Cersei

Abstract CLI

autoresearch

Similar ideas

Mux - Coding Agent Multiplexer

Features

VoltAgent

pochi

Tabby

pi-coding-agent

Feynman

goose

Claude Code

OmoiOS

opencode

Plandex

Live-SWE-agent

Agentless

aider

Skills

Repos that could be adapted to work with all agents

Other Tools