Synthesis: Formal Methods for Agentic AI Orchestration

A unified framework drawing from graph theory, category theory, finite state machines, and compiler theory — applied to the design of agent harnesses, orchestrators, and quality systems.

1. The Problem with Current Agent Orchestration

Today's agent orchestration is predominantly ad hoc: markdown skills, prompt chains, and custom harness code. This creates several failure modes:

No formal verification — you can't prove a workflow terminates, doesn't deadlock, or satisfies safety properties
No composability — quality gates, eval patterns, and workflow stages are one-off implementations
No visual comprehensibility — workflows live in code, not in inspectable graph structures
No scaling theory — adding more LLM calls for "safety" can actually degrade quality (Chen et al. 2024)
No separation between orchestration logic and agent logic — the control flow is entangled with the prompts

The CS formalisms below address each of these problems with well-established theory and tooling.

2. DAGs for Task Decomposition and Orchestration

The Formalism

A task decomposition is a directed acyclic graph D = (V, S):

V: Sub-task nodes {v_1, ..., v_n}
S: Dependency edges S ⊆ V × V
Acyclicity: No path v_i → ... → v_i

From Allegrini et al. (2025):

D := LLM.Build_Task_DAG(I_U, {EEinfo})

The LLM can construct the DAG, but the structure itself is a formal object with provable properties.

What DAGs Give You

1. Topological scheduling. Compute the execution order automatically:

for each node v in topological_sort(D):
    if all pred(v) ∈ COMPLETED:
        schedule(v)  # can run in parallel with other ready nodes

2. Deadlock-freedom by construction. Acyclic graphs cannot produce circular wait conditions. This is a theorem, not a hope.

3. Parallelism detection. Nodes with no edges between them can execute concurrently. The DAG reveals this automatically — no manual parallelism annotations needed.

4. Visual comprehensibility. A DAG is a picture. You can render it, inspect it, trace execution through it. This is the "n8n-style workflow" insight — the graph IS the observability surface.

5. Dependency validation. Before execution, check: are all required inputs available? Are all required capabilities registered? Is the graph well-formed? These are static checks on the graph structure.

The Hybrid LLM+Code Approach

The DAG structure enables a powerful hybrid:

LLM constructs the DAG — decomposes intent into sub-tasks with dependencies
Code enforces the DAG — the orchestrator follows topological order, enforces dependencies
Quality gates are DAG nodes — eval checkpoints are first-class nodes with edges to downstream tasks

This is exactly the "mix of both" described in the notes: code that follows a strict FSM gating transitions, with LLMs handling the creative/analytical work within each node.

Connection to AgentSeam

AgentSeam's Layer 4 (Session) already tracks turns and events. The DAG formalism extends this: a workflow is a DAG of sessions (or turns within a session). Each node is a unit of agent work. Edges are data dependencies. The orchestrator is a separate concern that schedules according to the DAG.

3. Finite State Machines for Lifecycle Control

The Formalism

From Allegrini et al. (2025), the task lifecycle is:

L = (S_t, s_0, E_t, δ)

S_t = {CREATED, AWAITING_DEPENDENCY, READY, DISPATCHING,
       IN_PROGRESS, COMPLETED, FAILED, RETRY_SCHEDULED,
       FALLBACK_SELECTED, CANCELED, ERROR}

s_0 = CREATED

δ: S_t × E_t → S_t  (deterministic transition function)

What FSMs Give You

1. Formal verification. Express properties in CTL temporal logic:

Safety: AG(state=DISPATCHING → previous_state=READY) — "you can never dispatch without being ready"
Liveness: AG(state=CREATED → AF(COMPLETED ∨ ERROR ∨ CANCELED)) — "every task eventually terminates"
Fairness: AG(state=AWAITING_DEPENDENCY → AF(state ≠ AWAITING_DEPENDENCY)) — "nothing waits forever"

These aren't aspirational — they're checkable by automated model checkers (SPIN, NuSMV, TLA+).

2. Guard conditions. Transitions have preconditions:

DISPATCHING only from READY (TL₅)
COMPLETED only from IN_PROGRESS (TL₆)
Terminal states are absorbing (TL₇, TL₉)

This prevents the "session looks stuck after abort" failure mode from claude-session-platform v1. The FSM makes illegal transitions unrepresentable.

3. Explicit recovery paths. The FAILED → RETRY_SCHEDULED → DISPATCHING cycle and FAILED → FALLBACK_SELECTED → DISPATCHING path are first-class transitions, not error-handling afterthoughts.

4. Observable state. Every task is in exactly one state at any time. Observability is trivial — just read the state. No need to infer "what's happening" from a stream of events.

The Hybrid: FSM-Gated LLM Transitions

The key insight from the notes: code controls the FSM, LLMs do the work within states.

[CREATED] --code checks dependencies--> [READY]
[READY]   --code dispatches agent-->     [DISPATCHING]
[DISPATCHING] --agent runtime-->         [IN_PROGRESS]
[IN_PROGRESS] --LLM produces output-->  [AWAITING_EVAL]
[AWAITING_EVAL] --eval runs-->          [COMPLETED] or [FAILED]
[FAILED]  --code checks retry policy--> [RETRY_SCHEDULED]

The transitions between states are deterministic code. The work within each state is LLM-driven. Quality gates are transitions guarded by eval results. This gives you the control of compiled code with the flexibility of LLM agents.

Connection to AgentSeam

AgentSeam already has a 10-state session model with 5 flags. The Allegrini model suggests extending this with:

Guard conditions on transitions (formalized, not just convention)
Temporal logic properties that can be verified
Recovery paths as first-class transitions (not exception handling)

4. Category Theory for Composable Workflows

The Formalism

A category C consists of:

Objects: Types (input/output schemas of workflow stages)
Morphisms (arrows): Transformations between types (workflow stages)
Composition: If f: A → B and g: B → C, then g ∘ f: A → C
Identity: For each object A, there exists id_A: A → A
Associativity: h ∘ (g ∘ f) = (h ∘ g) ∘ f

What Category Theory Gives You

1. Principled composability. If you can define the type of each workflow stage (its input and output), composition is automatic. You don't need to know how a stage works internally — just its type signature.

extract: RawData → StructuredData
transform: StructuredData → NormalizedData
evaluate: NormalizedData → QualityReport

// Compose:
pipeline = evaluate ∘ transform ∘ extract
// Type: RawData → QualityReport

2. Functors for domain adaptation. A functor F: C → D maps objects and morphisms from one category to another, preserving composition. This is the answer to "how to have composability across domains":

// Generic quality gate pattern (Category C):
gate: WorkProduct → QualityReport

// Functor F maps to code review domain (Category D):
F(gate): PullRequest → CodeReviewReport

// Functor G maps to content domain (Category E):
G(gate): Article → ContentQualityReport

The gate pattern is defined once. Functors adapt it to specific domains. The composition laws guarantee the adapted version still works correctly.

3. Natural transformations for workflow migration. A natural transformation η: F ⇒ G converts between two functors — i.e., between two domain adaptations of the same pattern. This enables migrating workflows between domains while preserving structure.

4. Monads for sequencing with effects. In functional programming, monads handle sequencing of operations with side effects. In agentic workflows, the "effects" are: LLM calls (non-deterministic), tool execution (side-effecting), quality gates (potentially failing). A monad captures the pattern:

type AgentStep<A> = {
    run: (context: Context) → Promise<Result<A>>
}

// Composition via bind/flatMap:
extractStep.flatMap(data =>
    transformStep(data).flatMap(normalized =>
        evaluateStep(normalized)
    )
)

The Practical Application: Composable Building Blocks

The notes describe the vision: "smaller building blocks that allow you to state certain principles, and then you can mix and match." Category theory formalizes this:

Building blocks are morphisms:

qualityGate: WorkProduct → EvalResult
llmJudge: Content → Verdict
humanReview: Verdict → Decision
retry: FailedResult → WorkProduct (with retry policy)

Composition gives you workflows:

fullReview = humanReview ∘ llmJudge ∘ qualityGate
autoReview = retry ∘ llmJudge ∘ qualityGate

Functors give you domain adaptation:

Same qualityGate pattern, adapted to code review, content review, data validation
The functor specifies the domain-specific criteria
The composition structure is preserved

Connection to AgentSeam

AgentSeam's event bus and enrichment consumer pattern (CAL) is a natural transformation: it watches one category of events (raw session events) and produces another (semantic annotations). The bus itself is a functor from the "session event" category to the "enrichment event" category.

The Layer 3 normalizer is already a functor: it maps from the "Claude SDK message" category to the "AgentMessage" category, preserving composition (message sequences map to message sequences).

5. Compiler Theory for Agent Pipeline Architecture

The Analogy

A compiler transforms source code through a series of intermediate representations:

Source Code → [Frontend] → AST → [Middle-end] → IR → [Backend] → Machine Code
                              ↓
                    Pass 1: Type checking
                    Pass 2: Optimization
                    Pass 3: Dead code elimination
                    Pass N: ...

An agent pipeline transforms user intent through a series of intermediate representations:

User Intent → [Decomposition] → Task DAG → [Execution] → Results → [Synthesis] → Output
                                    ↓
                         Pass 1: Dependency resolution
                         Pass 2: Capability matching
                         Pass 3: Quality gate evaluation
                         Pass N: ...

What Compiler Theory Gives You

1. Intermediate Representations (IRs). Each stage of the pipeline operates on a well-defined IR. The IR is the contract between stages. As long as the IR is respected, stages can be swapped independently.

For agent pipelines:

Task DAG is the IR between decomposition and execution
AgentMessage stream is the IR between runtime and observation (AgentSeam Layer 3)
SessionEvent log is the IR between execution and attention derivation (AgentSeam Layer 4→5)
Eval report is the IR between quality gates and retry logic

2. Multi-pass optimization. Compilers run multiple passes over the same IR, each improving it. Agent pipelines can do the same:

Pass 1: Static analysis — check types, schemas, dependencies
Pass 2: Cost estimation — estimate token cost per sub-task
Pass 3: Parallelism detection — identify independent sub-tasks
Pass 4: Quality gate insertion — add eval nodes at critical junctions
Pass 5: Resource allocation — assign models/providers to sub-tasks

These passes transform the Task DAG before execution begins. Each pass is independent and composable.

3. Compilation as verification. LangGraph already uses this pattern: after building a StateGraph with nodes and edges, you call .compile() which performs structural checks (no orphaned nodes, valid edge targets). This is the compiler frontend verifying syntax before generating code.

Extended to formal verification:

compile(graph) → {
    check: no orphaned nodes
    check: all edges target valid nodes
    check: no cycles (DAG property)
    check: all required inputs satisfied
    check: safety properties hold (CTL model checking)
    check: liveness properties hold (termination guaranteed)
    optimize: detect parallelizable stages
    optimize: compute optimal LLM call counts per node (Chen et al.)
}

4. Abstract interpretation. Compilers use abstract interpretation to reason about program behavior without executing it. For agent pipelines: simulate the workflow with abstract "types" instead of actual data to detect type mismatches, missing dependencies, or unreachable states before running expensive LLM calls.

5. SSA form and data flow. Static Single Assignment form tracks where each value is defined and used. For agent pipelines: track where each artifact is produced and consumed. Detect unused outputs, missing inputs, and redundant computations.

The Practical Application: Compilable Workflow Definitions

Instead of imperative harness code, define workflows declaratively and compile them:

const workflow = defineWorkflow({
    nodes: {
        research: { agent: "researcher", input: TaskSpec, output: ResearchReport },
        evaluate: { eval: "quality-gate", input: ResearchReport, output: EvalResult },
        write: { agent: "writer", input: ResearchReport, output: Article },
        review: { eval: "content-review", input: Article, output: ReviewResult },
    },
    edges: {
        research -> evaluate,
        evaluate[pass] -> write,
        evaluate[fail] -> research,  // retry
        write -> review,
        review[pass] -> output,
        review[fail] -> write,  // revise
    },
    constraints: {
        maxRetries: { research: 3, write: 2 },
        timeout: { research: "5m", write: "10m" },
    }
})

const compiled = compile(workflow)
// Verified: no deadlocks, all paths terminate, type safety holds
// Optimized: research and write can't run in parallel (dependency)
// Estimated: ~45k tokens, ~$0.15 per execution

Connection to AgentSeam

AgentSeam's layer architecture IS a compiler pipeline:

Layer 2 (Runtime) = Frontend — produces raw events from source (agent runtime)
Layer 3 (Normalization) = Middle-end — transforms to canonical IR (AgentMessage)
Layer 4 (Session) = Optimizer — maintains state, enforces invariants
Layer 5 (Attention) = Analysis pass — derives semantic information
Layer 6 (Server) = Backend — produces output for consumers
Layer 7 (View) = Linker — assembles final deliverable for the user

Each layer transforms an IR to the next. Each layer can be independently tested. The boundaries are well-defined contracts.

6. Scaling Laws and Quality Gate Optimization

The Non-Monotonic Scaling Problem

From Chen et al. (2024): adding more LLM calls to a quality gate doesn't always improve quality. When the eval task has a mix of easy and hard cases:

Easy cases benefit from voting (more calls → higher accuracy)
Hard cases suffer from voting (wrong answers dominate majority)
There exists an optimal K* that maximizes aggregate performance

Optimal Eval Configuration

For a quality gate using K parallel LLM judges:

K* = 2·log(α/(1-α))·(2p₁-1)/(1-2p₂) / log[p₂(1-p₂)/(p₁(1-p₁))]

where:
  α = fraction of "easy" eval cases
  p₁ = judge accuracy on easy cases
  p₂ = judge accuracy on hard cases

Practical Implications for Quality Gates

Don't just "add more judges." There's a mathematically optimal number.
Estimate difficulty distribution first. Run a small sample to determine what fraction of cases are easy vs. hard for your eval.
Filter-Vote can outperform Vote. Adding a pre-filter stage can improve hard-case performance by removing obvious bad answers before voting.
Different gates need different K. A code correctness gate (mostly deterministic) needs different K than a "is this persuasive" gate (highly subjective).

Connection to Composable Evals

This connects to the category theory composability vision: a quality gate is a morphism WorkProduct → EvalResult. The scaling law tells you how to parameterize that morphism:

qualityGate(K=3, filter=true): WorkProduct → EvalResult   // for easy domains
qualityGate(K=7, filter=false): WorkProduct → EvalResult   // for hard domains
qualityGate(K=1, filter=false): WorkProduct → EvalResult   // for deterministic checks

The gate's type signature is the same. The parameters come from the scaling law analysis. The composition still works.

7. The Unified Architecture

Putting it all together:

                    ┌─────────────────────────────────────┐
                    │        COMPILER LAYER                │
                    │  Parse → Verify → Optimize → Emit   │
                    └──────────────┬──────────────────────┘
                                   │ compiled workflow
                    ┌──────────────▼──────────────────────┐
                    │        DAG ORCHESTRATOR              │
                    │  Topological scheduling              │
                    │  Parallel execution                  │
                    │  Dependency tracking                 │
                    └──────────────┬──────────────────────┘
                                   │ per-node dispatch
               ┌───────────────────┼───────────────────┐
               ▼                   ▼                   ▼
    ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
    │   FSM-GOVERNED   │ │   FSM-GOVERNED   │ │   FSM-GOVERNED   │
    │   TASK NODE      │ │   EVAL NODE      │ │   AGENT NODE     │
    │                  │ │                  │ │                  │
    │ CREATED→READY→   │ │ Scaling law K*   │ │ LLM does work    │
    │ IN_PROGRESS→     │ │ determines how   │ │ within FSM       │
    │ COMPLETED/FAILED │ │ many judges run  │ │ state bounds     │
    └──────────────────┘ └──────────────────┘ └──────────────────┘
               │                   │                   │
               └───────────────────┼───────────────────┘
                                   │ results flow back
                    ┌──────────────▼──────────────────────┐
                    │     COMPOSABLE BUILDING BLOCKS       │
                    │  Category-theoretic composition      │
                    │  Functors for domain adaptation      │
                    │  Quality gates as morphisms          │
                    └─────────────────────────────────────┘

Layer Mapping

Formal Method	Role in Architecture	Concern
DAG	Workflow topology	What runs when
FSM	Per-node lifecycle	How each step executes
Category Theory	Building block composition	How pieces fit together
Compiler Theory	Workflow validation & optimization	Is it correct, can it be better
Scaling Laws	Eval node parameterization	How many judges per gate

Properties This Architecture Provides

Deadlock-free — DAG acyclicity guarantees no circular waits
Termination-guaranteed — FSM liveness properties ensure every task completes
Type-safe — Compiler verification ensures inputs/outputs match
Optimized — Scaling laws determine resource allocation per node
Composable — Category-theoretic composition enables mix-and-match building blocks
Observable — DAG + FSM state is inherently visual and inspectable
Verifiable — CTL/LTL properties can be model-checked before execution

8. Practical Recommendations for AgentSeam

Near-Term (Applicable Now)

Formalize the session state machine. AgentSeam's 10-state model should have explicit guard conditions and temporal logic properties, following the Allegrini pattern. This enables proving that sessions always terminate, never deadlock, and always recover from failures.
Add a compilation step to workflow definitions. Before executing a multi-step workflow, validate the graph structure: no cycles, all dependencies satisfiable, all capabilities available. This is LangGraph's .compile() pattern, extended with formal checks.
Use scaling laws for eval design. When building quality gates, empirically estimate the difficulty distribution of the eval task and compute optimal K using Chen et al.'s formula. Don't default to "3 judges."

Medium-Term (Design Consideration)

Define workflow stages as typed morphisms. Each stage has an input type, output type, and transformation function. Composition is automatic when types match. This enables the "composable building blocks" vision.
Build a DAG orchestrator as a Layer 4+ consumer. The orchestrator sits above AgentSeam's session layer and schedules work across sessions according to a DAG. Each node in the DAG maps to a session (or turn within a session).
Implement compiler passes for workflow optimization. Before executing a workflow DAG, run analysis passes: cost estimation, parallelism detection, quality gate insertion, resource allocation.

Long-Term (Vision)

Visual workflow builder. The DAG structure naturally supports visual editing (nodes and edges). This is the "n8n-style" vision — build workflows by connecting blocks rather than writing code.
Formal verification integration. Express desired properties in CTL/LTL and use model checkers (SPIN, NuSMV, or custom lightweight checkers) to verify workflows before execution. This catches deadlocks, infinite loops, and safety violations at design time.
Category-theoretic workflow library. Build a library of composable building blocks (quality gates, transforms, evals, retries) with formal composition rules. Domain adaptation via functors. This is the "accessible and lower-friction" vision for custom harnesses.

9. What This Means for the Broader Agentic Ecosystem

The agent orchestration space is converging on graph-based execution models. LangGraph (400 companies, 90M monthly downloads) already uses StateGraph with .compile(). The Allegrini paper provides the formal theory LangGraph currently lacks. Chen et al. provide the scaling theory that quality gate design currently lacks. Category theory provides the composability theory that workflow builders currently lack.

The opportunity is to build an orchestration layer that combines all four:

Graph structure from LangGraph's practical success
Formal verification from Allegrini's temporal logic properties
Scaling optimization from Chen's compound inference theory
Composable building blocks from category-theoretic composition

This is not theoretical — each of these has existing implementations or mathematical frameworks. The synthesis is new.

Sources

Reference Papers

Supporting Research

Agents Are Workflows — FSM/MDP formalization of agent workflows
LangGraph State Machines — Production state machine patterns
LangGraph Multi-Agent Orchestration Guide — Graph-based architecture analysis
The 2026 Guide to Agentic Workflow Architectures — Composable architecture patterns
Agentic Workflows in 2026 — Actor/critic quality gate patterns
Building AI Agents with Composable Patterns — Reusable building block patterns
Compiler-R1: Agentic Compiler Auto-tuning with RL — Compiler pass analogy for agent pipelines
Agentic AI Infrastructure Landscape 2025-2026 — Seven-layer agentic stack analysis
Category Theory for Programmers — Foundational reference
LangGraph Graph API — StateGraph compilation model

brennancheung/synthesis.md

Synthesis: Formal Methods for Agentic AI Orchestration

1. The Problem with Current Agent Orchestration

2. DAGs for Task Decomposition and Orchestration

The Formalism

What DAGs Give You

The Hybrid LLM+Code Approach

Connection to AgentSeam

3. Finite State Machines for Lifecycle Control

The Formalism

What FSMs Give You

The Hybrid: FSM-Gated LLM Transitions

Connection to AgentSeam

4. Category Theory for Composable Workflows

The Formalism

What Category Theory Gives You

The Practical Application: Composable Building Blocks

Connection to AgentSeam

5. Compiler Theory for Agent Pipeline Architecture

The Analogy

What Compiler Theory Gives You

The Practical Application: Compilable Workflow Definitions

Connection to AgentSeam

6. Scaling Laws and Quality Gate Optimization

The Non-Monotonic Scaling Problem

Optimal Eval Configuration

Practical Implications for Quality Gates

Connection to Composable Evals

7. The Unified Architecture

Layer Mapping

Properties This Architecture Provides

8. Practical Recommendations for AgentSeam

Near-Term (Applicable Now)

Medium-Term (Design Consideration)

Long-Term (Vision)

9. What This Means for the Broader Agentic Ecosystem

Sources

Reference Papers

Supporting Research