Skip to content

Instantly share code, notes, and snippets.

@adaburrows
Last active April 6, 2026 18:45
Show Gist options
  • Select an option

  • Save adaburrows/9ec8b42b9a3b5cb8d9ba099911bc574b to your computer and use it in GitHub Desktop.

Select an option

Save adaburrows/9ec8b42b9a3b5cb8d9ba099911bc574b to your computer and use it in GitHub Desktop.

AI Harness Engineering

Presentations

Reading

Research

Related Research

Agentic Tools/Environments

OpenSage (from the research paper above)

OpenSage (Open Self-programming Agent Generation Engine) is an AI-centric agent framework designed to shift agent development from a human-engineered, fixed paradigm to an AI-driven, self-programming one. Instead of requiring developers to hand-design workflows, tool lists, and memory logic for each task, OpenSage provides a minimal scaffold that lets the model create and orchestrate these components at runtime.

OpenSage is built around three core systems that strongly influence agent performance:

  • Self-generating agent topology: the agent can dynamically create, execute, and terminate sub-agents during task execution, supporting both vertical agent topology (decomposing a complex task into sequential sub-tasks handled by specialized sub-agents) and horizontal agent topology (multiple sub-agents execute the same task using distinct plans, then merge results via an agent ensemble mechanism).
  • Dynamic tool synthesis and management: the agent can create tools during execution (e.g., scripts, analyzers, generators), supported by a tooling runtime with tool-specific sandboxing and state management. Agents can also create skills with the OpenSage framework.
  • Hierarchical Memory Management: target-level long-term memory (a graph database for shareable knowledge) plus execution-based short-term memory (a graph structure for tracking agent runs), with a built-in, dedicated memory agent for memory management that can be enabled with a single line of code.

https://www.opensage-agent.ai/ https://github.com/opensage-agent/opensage-adk

Harbor

Harbor is a framework for evaluating and optimizing agents and models in container environments.

When we released Terminal-Bench in May, we were surprised to see it used in unexpected ways like building custom evals, optimizing prompts, running RL, generating SFT traces, and CI/CD agent testing.

We also learned that defining and managing containerized tasks at scale is hard. We built Harbor to make it easy.

Harbor provides:

  • Simple, modular interfaces for environments, agents, and tasks
  • All popular CLI agents pre-integrated
  • A registry of popular benchmarks and datasets
  • Integrations with cloud sandbox providers like Daytona, Modal, E2B and Runloop for horizontal scaling
  • Integrations with frameworks like SkyRL and GEPA for optimizing agents
uv tool install harbor

https://harborframework.com/

Terminus-2

Terminus-2 is Harbor's reference agent implementation, designed as a research-preview agent for evaluating language models' capabilities in terminal environments. It operates entirely autonomously within sandboxed environments and serves as a high-performance neutral test bed for understanding language model agent capabilities.

https://www.harborframework.com/docs/agents/terminus-2

Meta-Harness (from the research paper above)

Meta-Harness: 76.4% on Terminal-Bench 2.0 (Claude Opus 4.6)

Meta-Harness extends the Terminus-KIRA agent with environment bootstrapping: before the agent loop starts, it gathers a snapshot of the sandbox environment (working directory, file listing, available languages/tools, package managers, memory) and injects it into the initial prompt. This saves 2-5 early exploration turns that the agent normally spends on ls, which python3, etc.

The agent was discovered through automated harness evolution. More details coming soon.

https://github.com/stanford-iris-lab/meta-harness-tbench2-artifact

Terminus-KIRA

Terminus-KIRA is an agent harness for Terminal-Bench, built on top of Terminus 2. It boosts frontier model performance on Terminal-Bench through a set of minimal but effective harness-level improvements — native tool calling, multimodal support, execution optimization, and smarter completion verification.

https://github.com/krafton-ai/KIRA

Cersei

The Rust SDK for building coding agents. Tool execution, LLM streaming, graph memory, sub-agent orchestration, MCP — as composable library functions.

https://cersei.pacifio.dev/docs

Abstract CLI

Abstract is a complete CLI coding agent built on the Cersei SDK. One binary, zero runtime dependencies, graph memory by default.

cargo install --git https://github.com/pacifio/cersei abstract-cli

autoresearch

The idea: give an AI agent a small but real LLM training setup and let it experiment autonomously overnight. It modifies the code, trains for 5 minutes, checks if the result improved, keeps or discards, and repeats. You wake up in the morning to a log of experiments and (hopefully) a better model. The training code here is a simplified single-GPU implementation of nanochat. The core idea is that you're not touching any of the Python files like you normally would as a researcher. Instead, you are programming the program.md Markdown files that provide context to the AI agents and set up your autonomous research org. The default program.md in this repo is intentionally kept as a bare bones baseline, though it's obvious how one would iterate on it over time to find the "research org code" that achieves the fastest research progress, how you'd add more agents to the mix, etc. A bit more context on this project is here in this tweet and this tweet.

https://github.com/karpathy/autoresearch

Similar ideas

https://github.com/krafton-ai/Prompt2Policy

Mux - Coding Agent Multiplexer

Mux is a desktop & browser application for parallel agentic development. It enables developers to plan and execute tasks with multiple AI agents on local or remote compute.

Features

  • Isolated workspaces with central view on git divergence (docs)
    • Local: run directly in your project directory
    • Worktree: git worktrees on your local machine
    • SSH: remote execution on a server over SSH
  • Multi-model (sonnet-4-*, grok-*, gpt-5-*, opus-4-*)
    • Ollama supported for local LLMs (docs)
    • OpenRouter supported for long-tail of LLMs (docs)
  • VS Code Extension: Jump into Mux workspaces directly from VS Code (docs)
  • Supporting UI and keybinds for efficiently managing a suite of agents
  • Rich markdown outputs (mermaid diagrams, LaTeX, etc.)

Mux has a custom agent loop but much of the core UX is inspired by Claude Code. You'll find familiar features like Plan/Exec mode, vim inputs, /compact and new ones like opportunistic compaction and mode prompts.

https://github.com/coder/mux https://mux.coder.com/

VoltAgent

https://github.com/VoltAgent/voltagent

pochi

Pochi is an AI agent designed for software development. It operates within your IDE, using a toolkit of commands to execute complex tasks, from code generation to project-wide refactoring.

https://github.com/TabbyML/pochi

Tabby

Works with pochi.

Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features:

  • Self-contained, with no need for a DBMS or cloud service.
  • OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE).
  • Supports consumer-grade GPUs.

https://github.com/TabbyML/tabby

pi-coding-agent

Pi is a minimal terminal coding harness. Adapt pi to your workflows, not the other way around. Extend it with TypeScript extensions, skills, prompt templates, and themes. Bundle them as pi packages and share via npm or git.

Pi ships with powerful defaults but skips features like sub-agents and plan mode. Ask pi to build what you want, or install a package that does it your way.

Four modes: interactive, print/JSON, RPC, and SDK. See clawdbot for a real-world integration.

Feynman

Reads papers, searches the web, writes drafts, runs experiments, and cites every claim. All locally on your computer.

Has skills that could probably be adapted to fit any agent.

goose

goose is your on-machine AI agent, capable of automating complex development tasks from start to finish. More than just code suggestions, goose can build entire projects from scratch, write and execute code, debug failures, orchestrate workflows, and interact with external APIs - autonomously.

https://github.com/block/goose

Claude Code

  • Original
  • Free Code — Maybe not exactly above water, but is a result of the code map being accidentally shipped to NPM.
  • Claw CodeThe fastest repo in history to surpass 100K stars ⭐. Better Harness Tools that make real things done. Built in Rust using oh-my-codex.

OmoiOS

  • kivo360/OmoiOSOpen-source orchestration runtime that turns specs into PRs using parallel agent swarms in isolated sandboxes.

opencode

The open source coding agent.

Plandex

💻 Plandex is a terminal-based AI development tool that can plan and execute large coding tasks that span many steps and touch dozens of files. It can handle up to 2M tokens of context directly (~100k per file), and can index directories with 20M tokens or more using tree-sitter project maps.

🔬 A cumulative diff review sandbox keeps AI-generated changes separate from your project files until they are ready to go. Command execution is controlled so you can easily roll back and debug. Plandex helps you get the most out of AI without leaving behind a mess in your project.

🧠 Combine the best models from Anthropic, OpenAI, Google, and open source providers to build entire features and apps with a robust terminal-based workflow.

🚀 Plandex is capable of full autonomy—it can load relevant files, plan and implement changes, execute commands, and automatically debug—but it's also highly flexible and configurable, giving developers fine-grained control and a step-by-step review process when needed.

💪 Plandex is designed to be resilient to large projects and files. If you've found that others tools struggle once your project gets past a certain size or the changes are too complex, give Plandex a shot.

Live-SWE-agent

Live-SWE-agent is the first live, runtime self-evolving software engineering agent that expands and revises its own capabilities on the fly while working on a real-world issue. Our key insight is that software agents are themselves software systems, and modern LLM-based agents already possess the intrinsic capability to extend or modify their own behavior at runtime.

https://github.com/OpenAutoCoder/live-swe-agent

Agentless

Agentless is an agentless approach to automatically solve software development problems. To solve each issue, Agentless follows a simple three phase process: localization, repair, and patch validation.

  • 🙀 Localization: Agentless employs a hierarchical process to first localize the fault to specific files, then to relevant classes or functions, and finally to fine-grained edit locations
  • 😼 Repair: Agentless takes the edit locations and samples multiple candidate patches per bug in a simple diff format
  • 😸 Patch Validation: Agentless selects the regression tests to run and generates additional reproduction test to reproduce the original error. Using the test results, Agentless re-ranks all remaining patches to selects one to submit

https://github.com/OpenAutoCoder/Agentless

aider

The OG of terminal agents.

  • aiderAI pair programming in your terminal

Skills

Repos that could be adapted to work with all agents

Other Tools

  • wingthingSandboxed AI agents, reachable from anywhere
  • rtkCLI proxy that reduces LLM token consumption by 60-90% on common dev commands. Single Rust binary, zero dependencies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment