Skip to content

Instantly share code, notes, and snippets.

@egorsmkv
Created April 24, 2026 07:02
Show Gist options
  • Select an option

  • Save egorsmkv/1008d028cb2e8d3a346c3231b446dcc1 to your computer and use it in GitHub Desktop.

Select an option

Save egorsmkv/1008d028cb2e8d3a346c3231b446dcc1 to your computer and use it in GitHub Desktop.

Architecture for Autonomous Software Development with Ambidextrous AI Agents

Purpose

This architecture defines an autonomous software development system that can take bounded engineering work from request intake to reviewed pull request while balancing two operating modes:

  • Right Hand / Exploitation: reliable production execution using known workflows, strict gates, and minimal-risk changes.
  • Left Hand / Exploration: sandboxed discovery of better solutions, tools, prompts, tests, and architectural alternatives.

The system should behave like a disciplined engineering team: inspect the code, plan the work, make scoped changes, validate them, explain the result, and escalate when uncertainty or risk is high.

Design Principles

  1. Pull requests are the unit of delivery. Autonomous agents do not merge directly to protected branches.
  2. Production and experimentation are separated. Exploratory agents may propose patches, but production agents normalize and submit final changes.
  3. Evidence beats confidence. Tests, static analysis, benchmarks, diffs, and logs are preferred over agent assertions.
  4. Human control remains explicit. High-risk changes require approval before execution, PR creation, merge, or deployment.
  5. Learning is versioned. Prompts, playbooks, tools, and routing policies are tracked as operational artifacts.
  6. Autonomy expands only after measurement. More autonomous behavior is enabled only when success rates, review churn, and defect rates justify it.

High-Level System

flowchart TD
    A[Task Sources<br/>Issue, Ticket, Chat, API] --> B[Intake Layer]
    B --> C[Meta-Orchestrator]
    C --> D[Repository Context Service]
    C --> E[Right Hand Production Track]
    C --> F[Left Hand Exploration Track]
    D --> E
    D --> F
    E --> G[Validation and CI Layer]
    F --> H[Experiment Evaluation Layer]
    H --> I[Synthesis Agent]
    I --> E
    G --> J[Review Layer]
    J --> K[Pull Request Generator]
    K --> L[Human Review and Merge]
    G --> M[State, Logs, Metrics, Memory]
    H --> M
    J --> M
    M --> C
Loading

Core Components

1. Intake Layer

The intake layer accepts work from GitHub issues, ticketing systems, chat, or API calls and converts requests into a structured task.

Each task includes:

  • Objective
  • Repository and target branch
  • Acceptance criteria
  • Constraints and out-of-scope areas
  • Risk level
  • Required approvals
  • Validation expectations

The intake layer rejects or escalates tasks that are vague, unsafe, too broad, or missing required context.

2. Meta-Orchestrator

The meta-orchestrator decides how work should be handled.

Responsibilities:

  • Classify task type: bug fix, feature, refactor, test work, CI repair, dependency update, security change, migration, documentation.
  • Score task risk and novelty.
  • Route routine work to the Right Hand production track.
  • Route ambiguous, novel, repeatedly failing, or high-impact work to the Left Hand exploration track.
  • Set budgets for time, tokens, file changes, retries, and tool calls.
  • Enforce approval checkpoints.
  • Promote validated exploratory strategies into the production playbook.

Routing signals that should trigger exploration:

  • Missing or unstable acceptance criteria
  • Repeated test failures after standard fixes
  • Low confidence from production agents
  • Large unexplained diffs
  • Circular edit behavior
  • Security-sensitive files touched
  • Dependency or API uncertainty
  • Changes to high-churn or high-blast-radius modules
  • Contradictory agent conclusions

3. Repository Context Service

The repository context service gives agents grounded, current codebase context.

Inputs:

  • Source files
  • Dependency manifests
  • Test structure
  • Documentation
  • CI configuration
  • Recent commits
  • Open issues and pull requests
  • Known flaky tests and repository conventions

Preferred tools:

  • Text search with rg
  • AST and tree-sitter parsing
  • Language servers
  • Test discovery
  • Static analysis
  • Dependency graph inspection

LLMs may summarize context, but they should not replace direct source inspection.

4. Right Hand Production Track

The Right Hand track handles known engineering workflows with low variance.

Primary agents:

  • Planner Agent: creates a scoped implementation plan, expected files, validation commands, and rollback notes.
  • Implementation Agent: edits code in an isolated workspace while following local conventions.
  • Test Agent: adds or updates focused tests and selects the smallest useful validation set.
  • CI Agent: runs formatters, linters, type checks, tests, builds, and scans.
  • Review Agent: audits the diff for correctness, regressions, scope creep, security risk, and missing tests.
  • PR Agent: creates a draft or ready pull request with summary, test results, risks, and linked task.

The Right Hand path is optimized for:

  • Small bug fixes
  • Focused features
  • Test additions
  • Documentation updates
  • CI repair
  • Low-risk dependency updates

5. Left Hand Exploration Track

The Left Hand track explores alternatives in isolated sandboxes. Its output is advisory until promoted.

Primary agents:

  • Architect Explorer: proposes alternative designs and decomposition strategies.
  • Patch Explorer: creates candidate implementations in sandbox branches or worktrees.
  • Eval Designer: writes task-specific tests, edge cases, and adversarial checks.
  • Failure Analyst: searches for security flaws, rollback risks, and hidden regressions.
  • Tool Scout: evaluates new libraries, static analyzers, test tools, and code intelligence tools.
  • Synthesis Agent: compresses exploratory output into production-ready recommendations.

Exploration techniques:

  • Competing implementation branches
  • Test-first experiments
  • Security-first review passes
  • Performance-focused variants
  • Dependency-minimizing variants
  • Shadow execution against historical bugs or recorded traces
  • Differential testing between candidate solutions

6. Evaluation Layer

The evaluation layer compares candidate plans and patches using evidence.

Evaluation types:

  • Unit and integration tests
  • Regression tests
  • Golden-path and edge-case checks
  • Mutation testing
  • Static analysis
  • Type checks
  • Security scans
  • Dependency vulnerability scans
  • Performance benchmarks
  • Contract tests
  • Diff-size and complexity scoring
  • Human-review friction scoring

For long-term quality, the system should maintain a bug museum: historical production defects, flaky areas, incident-triggering inputs, and regression cases. New strategies must pass relevant historical failures before promotion.

7. Tooling Layer

Agents access tools through a controlled tool broker.

Tool categories:

  • Filesystem and workspace tools
  • Git and hosted Git provider APIs
  • Package managers
  • Test runners
  • Linters and formatters
  • Type checkers
  • Build systems
  • Static analysis tools
  • Security and secret scanners
  • Artifact storage
  • Secrets vault
  • Observability and log search

Every tool call is logged with:

  • Agent identity
  • Command or API call
  • Working directory
  • Inputs and sanitized outputs
  • Exit status
  • Timestamp
  • Related task and workspace

8. State, Memory, and Playbooks

The system stores three kinds of state.

Task state:

  • Current plan
  • Agent assignments
  • Files touched
  • Test results
  • Errors
  • Decisions
  • Open risks

Repository memory:

  • Setup commands
  • Project conventions
  • Known flaky tests
  • Common failure patterns
  • Useful validation commands
  • Ownership and review rules

Agent playbook registry:

  • Prompt versions
  • Workflow versions
  • Tool configurations
  • Routing policies
  • Eval recipes
  • Promotion history
  • Retirement history

Long-term memory must be reviewable, editable, and periodically pruned to avoid stale assumptions.

End-to-End Workflow

  1. A user, issue, ticket, or API submits a task.
  2. Intake normalizes the task and checks for missing information.
  3. The meta-orchestrator classifies task type, risk, and novelty.
  4. The repository context service gathers relevant code, tests, docs, and history.
  5. Routine work goes to the Right Hand production track.
  6. Ambiguous or high-uncertainty work also triggers Left Hand exploration.
  7. Production agents produce a minimal implementation and tests.
  8. Exploration agents produce candidate alternatives, evals, and risk notes.
  9. The evaluation layer compares evidence from tests, scans, benchmarks, and reviews.
  10. The synthesis agent recommends whether any exploratory result should influence the production patch.
  11. The production track creates the final normalized diff.
  12. CI and review agents validate the final diff.
  13. The PR agent opens a pull request with summary, tests, risks, and logs.
  14. Humans review, request changes, approve, or merge through the normal repository workflow.
  15. Outcomes update metrics, memory, and playbook versions.

Safety and Governance

Required safeguards:

  • Isolated workspace per task
  • No direct commits to protected branches
  • No autonomous production deployment by default
  • Explicit approval for database migrations, auth, billing, data deletion, infrastructure permissions, and secrets changes
  • Command allowlists for autonomous execution
  • Secret redaction in prompts, logs, and artifacts
  • Runtime, token, retry, and diff-size budgets
  • Escalation after repeated ambiguous failures
  • Mandatory tests for behavior changes
  • Full audit trail of plans, commands, diffs, validations, and decisions
  • Rollback plan for migrations, infrastructure, and high-blast-radius changes

Circuit breakers:

  • Stop agents that repeatedly edit the same files without improving eval results.
  • Pause tasks with contradictory high-confidence claims.
  • Escalate large diffs that exceed the planned scope.
  • Require approval when generated tests are weak or mostly assert implementation details.
  • Block PR creation when validation is skipped without a documented reason.

Operating Model

Structural Ambidexterity

Use separate production and exploration tracks.

  • Production agents submit final PRs.
  • Exploration agents work in sandboxes.
  • Synthesis agents translate validated exploration into production guidance.

Contextual Ambidexterity

Individual agents can switch mode when confidence drops.

Examples:

  • A production agent hits repeated test failures and asks for exploratory diagnosis.
  • A review agent detects a security-sensitive diff and triggers failure analysis.
  • A CI agent sees flaky behavior and requests test intelligence support.

Sequential Ambidexterity

Use periodic research cycles.

Examples:

  • Weekly review of failed tasks and slow tasks.
  • Monthly promotion or retirement of playbook strategies.
  • Quarterly evaluation of new tools and model workflows.

Deployment Architecture

Recommended implementation stack:

  • Orchestrator: Python or TypeScript service
  • Agent runtime: queue-based workers
  • Workspace isolation: containers, ephemeral VMs, or isolated worktrees
  • State store: Postgres
  • Artifact store: object storage for logs, patches, screenshots, and reports
  • Code intelligence: rg, tree-sitter, language servers, dependency graph tools
  • Git provider integration: GitHub or GitLab API plus local git
  • Validation: repository-native CI commands and package scripts
  • Dashboard: task status, plans, logs, diffs, approvals, metrics
  • Secrets: managed vault with scoped temporary credentials
  • Observability: structured logs, traces, task metrics, agent metrics

MVP

Start with a conservative GitHub pull request agent for low-risk tasks.

MVP agents:

  • Production Planner
  • Production Implementer
  • Test Agent
  • Review Agent
  • Exploration Challenger
  • Eval Designer
  • PR Agent

MVP capabilities:

  • Accept a GitHub issue or manual prompt.
  • Create an isolated branch or worktree.
  • Inspect the repository and produce a short plan.
  • Make scoped code changes for low-risk tasks.
  • Add or update focused tests.
  • Run configured validation commands.
  • Allow one exploratory challenger to propose risks or alternatives.
  • Open a draft PR with summary, tests, risks, and logs.
  • Require human review before merge.

MVP exclusions:

  • No automatic merge
  • No autonomous deploy
  • No database migrations without approval
  • No broad rewrites
  • No secret management changes
  • No multi-repository changes without approval
  • No automatic promotion of exploratory strategies

Success Metrics

Delivery metrics:

  • Task completion rate
  • Time from task intake to PR
  • CI pass rate on first PR submission
  • Review cycle count
  • Human intervention rate

Quality metrics:

  • Escaped defect rate
  • Rollback rate
  • Reopened issue rate
  • Test coverage on changed behavior
  • Static analysis and security finding rate

Exploration metrics:

  • Percentage of tasks where exploration improved outcome
  • False promotion rate
  • Cost per useful exploratory finding
  • Strategy win rate by task class
  • Time saved after playbook promotion

Governance metrics:

  • Approval policy violations
  • Secret exposure incidents
  • Sandbox escape incidents
  • Unreviewed memory updates
  • Budget overruns

Promotion Policy

Exploratory strategies can move into the production playbook only when they show repeated value.

Promotion criteria:

  • Passes relevant automated evals
  • Improves correctness, speed, test quality, review quality, or risk detection
  • Works across multiple tasks of the same class
  • Does not increase defect rate or review burden
  • Has a rollback path
  • Is versioned with clear ownership

Retirement criteria:

  • Strategy becomes slower than baseline
  • Produces stale or misleading recommendations
  • Increases human review burden
  • Fails against bug museum cases
  • Depends on deprecated tools or APIs

Target End State

The mature system is a controlled autonomous development organization:

  • The Right Hand reliably delivers scoped pull requests.
  • The Left Hand continuously searches for better methods.
  • The meta-orchestrator routes work based on risk, uncertainty, and evidence.
  • Evaluation gates prevent novelty from bypassing engineering discipline.
  • Human engineers retain approval authority over high-impact changes.
  • The system improves through measured, versioned promotion of successful strategies.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment