This architecture defines an autonomous software development system that can take bounded engineering work from request intake to reviewed pull request while balancing two operating modes:
- Right Hand / Exploitation: reliable production execution using known workflows, strict gates, and minimal-risk changes.
- Left Hand / Exploration: sandboxed discovery of better solutions, tools, prompts, tests, and architectural alternatives.
The system should behave like a disciplined engineering team: inspect the code, plan the work, make scoped changes, validate them, explain the result, and escalate when uncertainty or risk is high.
- Pull requests are the unit of delivery. Autonomous agents do not merge directly to protected branches.
- Production and experimentation are separated. Exploratory agents may propose patches, but production agents normalize and submit final changes.
- Evidence beats confidence. Tests, static analysis, benchmarks, diffs, and logs are preferred over agent assertions.
- Human control remains explicit. High-risk changes require approval before execution, PR creation, merge, or deployment.
- Learning is versioned. Prompts, playbooks, tools, and routing policies are tracked as operational artifacts.
- Autonomy expands only after measurement. More autonomous behavior is enabled only when success rates, review churn, and defect rates justify it.
flowchart TD
A[Task Sources<br/>Issue, Ticket, Chat, API] --> B[Intake Layer]
B --> C[Meta-Orchestrator]
C --> D[Repository Context Service]
C --> E[Right Hand Production Track]
C --> F[Left Hand Exploration Track]
D --> E
D --> F
E --> G[Validation and CI Layer]
F --> H[Experiment Evaluation Layer]
H --> I[Synthesis Agent]
I --> E
G --> J[Review Layer]
J --> K[Pull Request Generator]
K --> L[Human Review and Merge]
G --> M[State, Logs, Metrics, Memory]
H --> M
J --> M
M --> C
The intake layer accepts work from GitHub issues, ticketing systems, chat, or API calls and converts requests into a structured task.
Each task includes:
- Objective
- Repository and target branch
- Acceptance criteria
- Constraints and out-of-scope areas
- Risk level
- Required approvals
- Validation expectations
The intake layer rejects or escalates tasks that are vague, unsafe, too broad, or missing required context.
The meta-orchestrator decides how work should be handled.
Responsibilities:
- Classify task type: bug fix, feature, refactor, test work, CI repair, dependency update, security change, migration, documentation.
- Score task risk and novelty.
- Route routine work to the Right Hand production track.
- Route ambiguous, novel, repeatedly failing, or high-impact work to the Left Hand exploration track.
- Set budgets for time, tokens, file changes, retries, and tool calls.
- Enforce approval checkpoints.
- Promote validated exploratory strategies into the production playbook.
Routing signals that should trigger exploration:
- Missing or unstable acceptance criteria
- Repeated test failures after standard fixes
- Low confidence from production agents
- Large unexplained diffs
- Circular edit behavior
- Security-sensitive files touched
- Dependency or API uncertainty
- Changes to high-churn or high-blast-radius modules
- Contradictory agent conclusions
The repository context service gives agents grounded, current codebase context.
Inputs:
- Source files
- Dependency manifests
- Test structure
- Documentation
- CI configuration
- Recent commits
- Open issues and pull requests
- Known flaky tests and repository conventions
Preferred tools:
- Text search with
rg - AST and tree-sitter parsing
- Language servers
- Test discovery
- Static analysis
- Dependency graph inspection
LLMs may summarize context, but they should not replace direct source inspection.
The Right Hand track handles known engineering workflows with low variance.
Primary agents:
- Planner Agent: creates a scoped implementation plan, expected files, validation commands, and rollback notes.
- Implementation Agent: edits code in an isolated workspace while following local conventions.
- Test Agent: adds or updates focused tests and selects the smallest useful validation set.
- CI Agent: runs formatters, linters, type checks, tests, builds, and scans.
- Review Agent: audits the diff for correctness, regressions, scope creep, security risk, and missing tests.
- PR Agent: creates a draft or ready pull request with summary, test results, risks, and linked task.
The Right Hand path is optimized for:
- Small bug fixes
- Focused features
- Test additions
- Documentation updates
- CI repair
- Low-risk dependency updates
The Left Hand track explores alternatives in isolated sandboxes. Its output is advisory until promoted.
Primary agents:
- Architect Explorer: proposes alternative designs and decomposition strategies.
- Patch Explorer: creates candidate implementations in sandbox branches or worktrees.
- Eval Designer: writes task-specific tests, edge cases, and adversarial checks.
- Failure Analyst: searches for security flaws, rollback risks, and hidden regressions.
- Tool Scout: evaluates new libraries, static analyzers, test tools, and code intelligence tools.
- Synthesis Agent: compresses exploratory output into production-ready recommendations.
Exploration techniques:
- Competing implementation branches
- Test-first experiments
- Security-first review passes
- Performance-focused variants
- Dependency-minimizing variants
- Shadow execution against historical bugs or recorded traces
- Differential testing between candidate solutions
The evaluation layer compares candidate plans and patches using evidence.
Evaluation types:
- Unit and integration tests
- Regression tests
- Golden-path and edge-case checks
- Mutation testing
- Static analysis
- Type checks
- Security scans
- Dependency vulnerability scans
- Performance benchmarks
- Contract tests
- Diff-size and complexity scoring
- Human-review friction scoring
For long-term quality, the system should maintain a bug museum: historical production defects, flaky areas, incident-triggering inputs, and regression cases. New strategies must pass relevant historical failures before promotion.
Agents access tools through a controlled tool broker.
Tool categories:
- Filesystem and workspace tools
- Git and hosted Git provider APIs
- Package managers
- Test runners
- Linters and formatters
- Type checkers
- Build systems
- Static analysis tools
- Security and secret scanners
- Artifact storage
- Secrets vault
- Observability and log search
Every tool call is logged with:
- Agent identity
- Command or API call
- Working directory
- Inputs and sanitized outputs
- Exit status
- Timestamp
- Related task and workspace
The system stores three kinds of state.
Task state:
- Current plan
- Agent assignments
- Files touched
- Test results
- Errors
- Decisions
- Open risks
Repository memory:
- Setup commands
- Project conventions
- Known flaky tests
- Common failure patterns
- Useful validation commands
- Ownership and review rules
Agent playbook registry:
- Prompt versions
- Workflow versions
- Tool configurations
- Routing policies
- Eval recipes
- Promotion history
- Retirement history
Long-term memory must be reviewable, editable, and periodically pruned to avoid stale assumptions.
- A user, issue, ticket, or API submits a task.
- Intake normalizes the task and checks for missing information.
- The meta-orchestrator classifies task type, risk, and novelty.
- The repository context service gathers relevant code, tests, docs, and history.
- Routine work goes to the Right Hand production track.
- Ambiguous or high-uncertainty work also triggers Left Hand exploration.
- Production agents produce a minimal implementation and tests.
- Exploration agents produce candidate alternatives, evals, and risk notes.
- The evaluation layer compares evidence from tests, scans, benchmarks, and reviews.
- The synthesis agent recommends whether any exploratory result should influence the production patch.
- The production track creates the final normalized diff.
- CI and review agents validate the final diff.
- The PR agent opens a pull request with summary, tests, risks, and logs.
- Humans review, request changes, approve, or merge through the normal repository workflow.
- Outcomes update metrics, memory, and playbook versions.
Required safeguards:
- Isolated workspace per task
- No direct commits to protected branches
- No autonomous production deployment by default
- Explicit approval for database migrations, auth, billing, data deletion, infrastructure permissions, and secrets changes
- Command allowlists for autonomous execution
- Secret redaction in prompts, logs, and artifacts
- Runtime, token, retry, and diff-size budgets
- Escalation after repeated ambiguous failures
- Mandatory tests for behavior changes
- Full audit trail of plans, commands, diffs, validations, and decisions
- Rollback plan for migrations, infrastructure, and high-blast-radius changes
Circuit breakers:
- Stop agents that repeatedly edit the same files without improving eval results.
- Pause tasks with contradictory high-confidence claims.
- Escalate large diffs that exceed the planned scope.
- Require approval when generated tests are weak or mostly assert implementation details.
- Block PR creation when validation is skipped without a documented reason.
Use separate production and exploration tracks.
- Production agents submit final PRs.
- Exploration agents work in sandboxes.
- Synthesis agents translate validated exploration into production guidance.
Individual agents can switch mode when confidence drops.
Examples:
- A production agent hits repeated test failures and asks for exploratory diagnosis.
- A review agent detects a security-sensitive diff and triggers failure analysis.
- A CI agent sees flaky behavior and requests test intelligence support.
Use periodic research cycles.
Examples:
- Weekly review of failed tasks and slow tasks.
- Monthly promotion or retirement of playbook strategies.
- Quarterly evaluation of new tools and model workflows.
Recommended implementation stack:
- Orchestrator: Python or TypeScript service
- Agent runtime: queue-based workers
- Workspace isolation: containers, ephemeral VMs, or isolated worktrees
- State store: Postgres
- Artifact store: object storage for logs, patches, screenshots, and reports
- Code intelligence:
rg, tree-sitter, language servers, dependency graph tools - Git provider integration: GitHub or GitLab API plus local
git - Validation: repository-native CI commands and package scripts
- Dashboard: task status, plans, logs, diffs, approvals, metrics
- Secrets: managed vault with scoped temporary credentials
- Observability: structured logs, traces, task metrics, agent metrics
Start with a conservative GitHub pull request agent for low-risk tasks.
MVP agents:
- Production Planner
- Production Implementer
- Test Agent
- Review Agent
- Exploration Challenger
- Eval Designer
- PR Agent
MVP capabilities:
- Accept a GitHub issue or manual prompt.
- Create an isolated branch or worktree.
- Inspect the repository and produce a short plan.
- Make scoped code changes for low-risk tasks.
- Add or update focused tests.
- Run configured validation commands.
- Allow one exploratory challenger to propose risks or alternatives.
- Open a draft PR with summary, tests, risks, and logs.
- Require human review before merge.
MVP exclusions:
- No automatic merge
- No autonomous deploy
- No database migrations without approval
- No broad rewrites
- No secret management changes
- No multi-repository changes without approval
- No automatic promotion of exploratory strategies
Delivery metrics:
- Task completion rate
- Time from task intake to PR
- CI pass rate on first PR submission
- Review cycle count
- Human intervention rate
Quality metrics:
- Escaped defect rate
- Rollback rate
- Reopened issue rate
- Test coverage on changed behavior
- Static analysis and security finding rate
Exploration metrics:
- Percentage of tasks where exploration improved outcome
- False promotion rate
- Cost per useful exploratory finding
- Strategy win rate by task class
- Time saved after playbook promotion
Governance metrics:
- Approval policy violations
- Secret exposure incidents
- Sandbox escape incidents
- Unreviewed memory updates
- Budget overruns
Exploratory strategies can move into the production playbook only when they show repeated value.
Promotion criteria:
- Passes relevant automated evals
- Improves correctness, speed, test quality, review quality, or risk detection
- Works across multiple tasks of the same class
- Does not increase defect rate or review burden
- Has a rollback path
- Is versioned with clear ownership
Retirement criteria:
- Strategy becomes slower than baseline
- Produces stale or misleading recommendations
- Increases human review burden
- Fails against bug museum cases
- Depends on deprecated tools or APIs
The mature system is a controlled autonomous development organization:
- The Right Hand reliably delivers scoped pull requests.
- The Left Hand continuously searches for better methods.
- The meta-orchestrator routes work based on risk, uncertainty, and evidence.
- Evaluation gates prevent novelty from bypassing engineering discipline.
- Human engineers retain approval authority over high-impact changes.
- The system improves through measured, versioned promotion of successful strategies.