Architecture for Autonomous Software Development with Ambidextrous AI Agents

Purpose

This architecture defines an autonomous software development system that can take bounded engineering work from request intake to reviewed pull request while balancing two operating modes:

Right Hand / Exploitation: reliable production execution using known workflows, strict gates, and minimal-risk changes.
Left Hand / Exploration: sandboxed discovery of better solutions, tools, prompts, tests, and architectural alternatives.

The system should behave like a disciplined engineering team: inspect the code, plan the work, make scoped changes, validate them, explain the result, and escalate when uncertainty or risk is high.

Design Principles

Pull requests are the unit of delivery. Autonomous agents do not merge directly to protected branches.
Production and experimentation are separated. Exploratory agents may propose patches, but production agents normalize and submit final changes.
Evidence beats confidence. Tests, static analysis, benchmarks, diffs, and logs are preferred over agent assertions.
Human control remains explicit. High-risk changes require approval before execution, PR creation, merge, or deployment.
Learning is versioned. Prompts, playbooks, tools, and routing policies are tracked as operational artifacts.
Autonomy expands only after measurement. More autonomous behavior is enabled only when success rates, review churn, and defect rates justify it.

High-Level System

flowchart TD
    A[Task Sources<br/>Issue, Ticket, Chat, API] --> B[Intake Layer]
    B --> C[Meta-Orchestrator]
    C --> D[Repository Context Service]
    C --> E[Right Hand Production Track]
    C --> F[Left Hand Exploration Track]
    D --> E
    D --> F
    E --> G[Validation and CI Layer]
    F --> H[Experiment Evaluation Layer]
    H --> I[Synthesis Agent]
    I --> E
    G --> J[Review Layer]
    J --> K[Pull Request Generator]
    K --> L[Human Review and Merge]
    G --> M[State, Logs, Metrics, Memory]
    H --> M
    J --> M
    M --> C

Core Components

1. Intake Layer

The intake layer accepts work from GitHub issues, ticketing systems, chat, or API calls and converts requests into a structured task.

Each task includes:

Objective
Repository and target branch
Acceptance criteria
Constraints and out-of-scope areas
Risk level
Required approvals
Validation expectations

The intake layer rejects or escalates tasks that are vague, unsafe, too broad, or missing required context.

2. Meta-Orchestrator

The meta-orchestrator decides how work should be handled.

Responsibilities:

Classify task type: bug fix, feature, refactor, test work, CI repair, dependency update, security change, migration, documentation.
Score task risk and novelty.
Route routine work to the Right Hand production track.
Route ambiguous, novel, repeatedly failing, or high-impact work to the Left Hand exploration track.
Set budgets for time, tokens, file changes, retries, and tool calls.
Enforce approval checkpoints.
Promote validated exploratory strategies into the production playbook.

Routing signals that should trigger exploration:

Missing or unstable acceptance criteria
Repeated test failures after standard fixes
Low confidence from production agents
Large unexplained diffs
Circular edit behavior
Security-sensitive files touched
Dependency or API uncertainty
Changes to high-churn or high-blast-radius modules
Contradictory agent conclusions

3. Repository Context Service

The repository context service gives agents grounded, current codebase context.

Inputs:

Source files
Dependency manifests
Test structure
Documentation
CI configuration
Recent commits
Open issues and pull requests
Known flaky tests and repository conventions

Preferred tools:

Text search with rg
AST and tree-sitter parsing
Language servers
Test discovery
Static analysis
Dependency graph inspection

LLMs may summarize context, but they should not replace direct source inspection.

4. Right Hand Production Track

The Right Hand track handles known engineering workflows with low variance.

Primary agents:

Planner Agent: creates a scoped implementation plan, expected files, validation commands, and rollback notes.
Implementation Agent: edits code in an isolated workspace while following local conventions.
Test Agent: adds or updates focused tests and selects the smallest useful validation set.
CI Agent: runs formatters, linters, type checks, tests, builds, and scans.
Review Agent: audits the diff for correctness, regressions, scope creep, security risk, and missing tests.
PR Agent: creates a draft or ready pull request with summary, test results, risks, and linked task.

The Right Hand path is optimized for:

Small bug fixes
Focused features
Test additions
Documentation updates
CI repair
Low-risk dependency updates

5. Left Hand Exploration Track

The Left Hand track explores alternatives in isolated sandboxes. Its output is advisory until promoted.

Primary agents:

Architect Explorer: proposes alternative designs and decomposition strategies.
Patch Explorer: creates candidate implementations in sandbox branches or worktrees.
Eval Designer: writes task-specific tests, edge cases, and adversarial checks.
Failure Analyst: searches for security flaws, rollback risks, and hidden regressions.
Tool Scout: evaluates new libraries, static analyzers, test tools, and code intelligence tools.
Synthesis Agent: compresses exploratory output into production-ready recommendations.

Exploration techniques:

Competing implementation branches
Test-first experiments
Security-first review passes
Performance-focused variants
Dependency-minimizing variants
Shadow execution against historical bugs or recorded traces
Differential testing between candidate solutions

6. Evaluation Layer

The evaluation layer compares candidate plans and patches using evidence.

Evaluation types:

Unit and integration tests
Regression tests
Golden-path and edge-case checks
Mutation testing
Static analysis
Type checks
Security scans
Dependency vulnerability scans
Performance benchmarks
Contract tests
Diff-size and complexity scoring
Human-review friction scoring

For long-term quality, the system should maintain a bug museum: historical production defects, flaky areas, incident-triggering inputs, and regression cases. New strategies must pass relevant historical failures before promotion.

7. Tooling Layer

Agents access tools through a controlled tool broker.

Tool categories:

Filesystem and workspace tools
Git and hosted Git provider APIs
Package managers
Test runners
Linters and formatters
Type checkers
Build systems
Static analysis tools
Security and secret scanners
Artifact storage
Secrets vault
Observability and log search

Every tool call is logged with:

Agent identity
Command or API call
Working directory
Inputs and sanitized outputs
Exit status
Timestamp
Related task and workspace

8. State, Memory, and Playbooks

The system stores three kinds of state.

Task state:

Current plan
Agent assignments
Files touched
Test results
Errors
Decisions
Open risks

Repository memory:

Setup commands
Project conventions
Known flaky tests
Common failure patterns
Useful validation commands
Ownership and review rules

Agent playbook registry:

Prompt versions
Workflow versions
Tool configurations
Routing policies
Eval recipes
Promotion history
Retirement history

Long-term memory must be reviewable, editable, and periodically pruned to avoid stale assumptions.

End-to-End Workflow

A user, issue, ticket, or API submits a task.
Intake normalizes the task and checks for missing information.
The meta-orchestrator classifies task type, risk, and novelty.
The repository context service gathers relevant code, tests, docs, and history.
Routine work goes to the Right Hand production track.
Ambiguous or high-uncertainty work also triggers Left Hand exploration.
Production agents produce a minimal implementation and tests.
Exploration agents produce candidate alternatives, evals, and risk notes.
The evaluation layer compares evidence from tests, scans, benchmarks, and reviews.
The synthesis agent recommends whether any exploratory result should influence the production patch.
The production track creates the final normalized diff.
CI and review agents validate the final diff.
The PR agent opens a pull request with summary, tests, risks, and logs.
Humans review, request changes, approve, or merge through the normal repository workflow.
Outcomes update metrics, memory, and playbook versions.

Safety and Governance

Required safeguards:

Isolated workspace per task
No direct commits to protected branches
No autonomous production deployment by default
Explicit approval for database migrations, auth, billing, data deletion, infrastructure permissions, and secrets changes
Command allowlists for autonomous execution
Secret redaction in prompts, logs, and artifacts
Runtime, token, retry, and diff-size budgets
Escalation after repeated ambiguous failures
Mandatory tests for behavior changes
Full audit trail of plans, commands, diffs, validations, and decisions
Rollback plan for migrations, infrastructure, and high-blast-radius changes

Circuit breakers:

Stop agents that repeatedly edit the same files without improving eval results.
Pause tasks with contradictory high-confidence claims.
Escalate large diffs that exceed the planned scope.
Require approval when generated tests are weak or mostly assert implementation details.
Block PR creation when validation is skipped without a documented reason.

Operating Model

Structural Ambidexterity

Use separate production and exploration tracks.

Production agents submit final PRs.
Exploration agents work in sandboxes.
Synthesis agents translate validated exploration into production guidance.

Contextual Ambidexterity

Individual agents can switch mode when confidence drops.

Examples:

A production agent hits repeated test failures and asks for exploratory diagnosis.
A review agent detects a security-sensitive diff and triggers failure analysis.
A CI agent sees flaky behavior and requests test intelligence support.

Sequential Ambidexterity

Use periodic research cycles.

Examples:

Weekly review of failed tasks and slow tasks.
Monthly promotion or retirement of playbook strategies.
Quarterly evaluation of new tools and model workflows.

Deployment Architecture

Recommended implementation stack:

Orchestrator: Python or TypeScript service
Agent runtime: queue-based workers
Workspace isolation: containers, ephemeral VMs, or isolated worktrees
State store: Postgres
Artifact store: object storage for logs, patches, screenshots, and reports
Code intelligence: rg, tree-sitter, language servers, dependency graph tools
Git provider integration: GitHub or GitLab API plus local git
Validation: repository-native CI commands and package scripts
Dashboard: task status, plans, logs, diffs, approvals, metrics
Secrets: managed vault with scoped temporary credentials
Observability: structured logs, traces, task metrics, agent metrics

MVP

Start with a conservative GitHub pull request agent for low-risk tasks.

MVP agents:

Production Planner
Production Implementer
Test Agent
Review Agent
Exploration Challenger
Eval Designer
PR Agent

MVP capabilities:

Accept a GitHub issue or manual prompt.
Create an isolated branch or worktree.
Inspect the repository and produce a short plan.
Make scoped code changes for low-risk tasks.
Add or update focused tests.
Run configured validation commands.
Allow one exploratory challenger to propose risks or alternatives.
Open a draft PR with summary, tests, risks, and logs.
Require human review before merge.

MVP exclusions:

No automatic merge
No autonomous deploy
No database migrations without approval
No broad rewrites
No secret management changes
No multi-repository changes without approval
No automatic promotion of exploratory strategies

Success Metrics

Delivery metrics:

Task completion rate
Time from task intake to PR
CI pass rate on first PR submission
Review cycle count
Human intervention rate

Quality metrics:

Escaped defect rate
Rollback rate
Reopened issue rate
Test coverage on changed behavior
Static analysis and security finding rate

Exploration metrics:

Percentage of tasks where exploration improved outcome
False promotion rate
Cost per useful exploratory finding
Strategy win rate by task class
Time saved after playbook promotion

Governance metrics:

Approval policy violations
Secret exposure incidents
Sandbox escape incidents
Unreviewed memory updates
Budget overruns

Promotion Policy

Exploratory strategies can move into the production playbook only when they show repeated value.

Promotion criteria:

Passes relevant automated evals
Improves correctness, speed, test quality, review quality, or risk detection
Works across multiple tasks of the same class
Does not increase defect rate or review burden
Has a rollback path
Is versioned with clear ownership

Retirement criteria:

Strategy becomes slower than baseline
Produces stale or misleading recommendations
Increases human review burden
Fails against bug museum cases
Depends on deprecated tools or APIs

Target End State

The mature system is a controlled autonomous development organization:

The Right Hand reliably delivers scoped pull requests.
The Left Hand continuously searches for better methods.
The meta-orchestrator routes work based on risk, uncertainty, and evidence.
Evaluation gates prevent novelty from bypassing engineering discipline.
Human engineers retain approval authority over high-impact changes.
The system improves through measured, versioned promotion of successful strategies.

egorsmkv/ambidextrous-agentic-software-factory.md

Select an option

No results found