| title | XState Nested and Parallel Patterns for Workflow Orchestration | |||
|---|---|---|---|---|
| date | 2026-04-12 | |||
| type | research | |||
| status | complete | |||
| parent | kampus/xstate-nested-parallel-patterns | |||
| cross-references |
|
The operator's current workflow.json schema -- a flat map of independent task state machines, each running its own do-work -> qa -> passed/tripped cycle -- is a correct but limited model. It works because every task is modeled as an island. No task knows about any other task. The state-ledger CLI iterates over them one at a time. The operator agent picks one, does it, transitions it, picks another. This is sequential execution with extra bookkeeping.
The limitation is architectural, not functional. The flat map cannot express "these three tasks are independent and should run concurrently" versus "this task depends on that task and must wait." It cannot express "all research subtasks form a parallel phase, and the consolidation task is a sequential successor gated on their joint completion." The operator agent encodes these relationships in its own prompt instructions -- in natural language, in its head -- rather than in the state machine that is supposed to be the source of truth. Every time the operator makes a sequencing decision, it is doing work that the machine should be doing for it.
XState's parallel and compound state primitives solve this directly. A parallel state node activates all child regions simultaneously and gates its onDone transition on their joint completion -- fork-join concurrency as a declarative primitive. Compound (nested) states decompose a single task's lifecycle into substates that share context and event handling. Invoked actors provide hard boundaries when tasks need independent contexts and lifecycle management. The pattern that maps to the operator's needs is straightforward: the top-level machine is sequential (phase1 -> phase2 -> phase3), specific phases are parallel (all research subtasks run concurrently within phase1), and individual tasks within parallel regions are either compound states or invoked submachines depending on whether they need isolation.
The migration is viable but demands respect for XState's sharp edges. The onDone transition on parallel states has been the single most bug-reported feature in XState's history, with at least four correctness violations shipped across versions. History state persistence is broken when round-tripping through JSON. Schema evolution for persisted snapshots is completely unaddressed -- no versioning, no migration path, no compatibility guarantees. The recommendation is clear: use parallel states for the top-level concurrency concern, keep nesting shallow (two levels maximum), avoid history states in parallel regions, pin your XState version, and test onDone completion aggressively. The state-ledger API must evolve from returning flat string state values to returning compound state value objects, and the CLI must learn to display nested state trees.
The prize is worth the cost. A single XState machine replaces the flat task map plus the operator's implicit sequencing logic. The machine becomes the authority not just for individual task state but for workflow topology. Persistence comes free via getPersistedSnapshot(). The operator agent becomes simpler because the machine handles the "what can run next" question, and the agent only handles the "how to do the work" question. The separation of concerns is exactly right.
The foundational insight is that XState's parallel state value is not a flat bag of flags. It is a recursively nested object that mirrors the machine's hierarchical structure, and that structure IS the runtime representation of concurrent state.
A parallel state node (type: "parallel") activates every direct child region simultaneously upon entry. There is no initial property -- every region starts. Events broadcast to all regions. Context is shared across all regions. And the onDone transition fires if and only if every child region has reached a type: "final" state. This is fork-join concurrency modeled declaratively, and it is the primitive the operator needs for concurrent task groups.
The state value for a machine with a parallel region embedded in a sequential parent looks like this:
{
"running": {
"trackA": "task1",
"trackB": "task3"
}
}Leaf values are always strings. The nesting is fully recursive -- parallel inside sequential inside parallel produces deeper objects. This maps directly onto the operator's task structure: each task's current status string becomes a leaf in the compound state value.
For querying, snapshot.hasTag() is the most resilient primitive. Tag each task state with its semantic status ('pending', 'running', 'complete', 'failed') and query by tag. This decouples consuming code from the machine's structural hierarchy entirely. If you refactor the nesting, tags still work. snapshot.matches() is the workhorse for specific state checks, supporting partial matching: snapshot.matches({ running: { trackA: 'task1' } }) returns true regardless of trackB's state.
Persistence is clean. getPersistedSnapshot() produces a five-field JSON object: status, value, historyValue, context, and children. The persisted snapshot is a minimal diff against the machine definition -- you store only leaf state strings and context, not the entire state graph. Restoration via createActor(machine, { snapshot: persisted }) works correctly and starts processing events from the persisted position. For the operator, this means: persist the snapshot, reconstruct everything else from the machine definition at startup.
XState gives you two fundamentally different ways to nest behavior, and choosing wrong will cost you.
Compound states are hierarchy without boundaries. Child states share the parent's context, event bus, and lifecycle. Event bubbling means a CANCEL handler on the parent applies to every child for free. The state value nests accordingly: { preparation: 'grinding' }. This is the right tool when nested states are logically part of the same process -- a task's do-work -> qa -> passed substeps are one thing viewed at different zoom levels. The litmus test: can the child make sense without the parent? If no, use a compound state.
Invoked actors are real boundaries with real independence. An invoked actor has its own context, its own event processing, and its own lifecycle tied to the invoking state. The parent only sees "done" or "error" -- the child's internal states are opaque. Communication happens via explicit message passing with sendTo(), replacing v4's implicit sendParent(). The litmus test: does the parent need to observe the child's internal states? If no, invoke it.
For the operator's workflow, the decision maps cleanly. The overall workflow phases (research, implementation, review) are compound states -- they are one process with substeps. Individual tasks within a parallel research phase could be either compound states (if the operator needs to observe their internal substates for status display) or invoked actors (if the operator only cares about done/failed). The recommendation is compound states for tasks within parallel regions, because the operator's status display explicitly shows task substates (do-work, qa, passed, tripped), which requires the parent machine to see into the child's state.
The setup() pattern is the correct way to declare actors in v5. It registers actors by name, giving you type safety, centralized configuration, and testability. The inline approach works but loses type inference and is a code smell in anything non-trivial.
Here is the key architectural insight most tutorials miss: you do not add sequencing to a parallel machine. You build a sequential machine and embed parallel regions where concurrency is needed.
Sequential is the skeleton; parallel is the organ. The top-level states define the pipeline order. Parallelism is nested inside specific steps. The canonical pattern for "tasks 1+2 in parallel, then task 3, then tasks 4+5 in parallel" is three sequential phases where phases 1 and 3 happen to be type: "parallel":
phase1 (parallel: task1 + task2) -> phase2 (sequential: task3) -> phase3 (parallel: task4 + task5) -> complete
The onDone transition on each parallel phase is the barrier. It fires when all regions reach their final states. There is no race condition, no "what if both finish at the same time" edge case. The statechart formalism handles simultaneous completion by definition. This is Promise.all with structure, visibility, and inspectability.
Each region within a parallel state is a full state machine. It can have its own sequential substates, its own nested parallel states (though exercise restraint -- more than two levels of nesting and debugging becomes impossible), its own invocations. The nesting is recursive, and so is the onDone bubbling.
Error handling in parallel phases requires an explicit design decision. If a track's failed state has type: "final", the track is "done" (albeit with an error) and onDone fires once all tracks complete. If failed is NOT final, the workflow deadlocks until something retries. Leaving failed as a non-final state without a recovery path is the single most common bug in parallel XState workflows. For the operator, the right choice is: failed/tripped states are final, and the phase's onDone transition uses a guard to check context for errors before proceeding to the next phase. This preserves the circuit breaker pattern -- a tripped task terminates its track but does not block the entire phase.
Cross-region coordination uses three mechanisms: shared context (all regions read/write the same context -- use distinct keys per track to avoid conflicts), event broadcasting (every event goes to all regions -- this is the primary coordination primitive), and stateIn() guards (check a sibling region's state -- use sparingly, as tight coupling between regions defeats the purpose of parallelism).
XState's parallel and nested primitives are production-ready for shallow, well-separated concerns with disjoint event vocabularies. They degrade along every axis simultaneously as complexity increases. The failure modes that matter for the operator:
onDone has been broken repeatedly. At least four correctness bugs shipped: premature firing when the first (not all) regions completed (#326), history nodes counting as regions that never finish (#3170), nested parallel states not bubbling done events (#2349), and various cases of onDone simply not firing (#1111). The nested parallel completion bug was only fixed in PR #4358. If the operator's architecture depends on parallel onDone for correctness -- and it will -- this must be tested aggressively against the pinned XState version.
History state persistence is broken. Issue #5178: getPersistedSnapshot() serializes historyValue as plain objects, but deserialization does not correctly reconstruct StateNode references. History transitions silently route to initial states instead of remembered states. The failure is silent -- no error, just wrong behavior. The operator's current workflow.json uses type: "history" for the blocked -> unblocked flow. This is a direct conflict. Either remove history states from the parallel machine design, or avoid the JSON round-trip and pass snapshot objects directly.
Schema evolution is unaddressed. There is no versioning, migration, or compatibility story for persisted snapshots. If the machine definition changes -- add a state, remove a state, rename a region -- and a snapshot from the old definition is restored, behavior is undefined. Discussion #4828 explicitly calls out the absence of documentation on this. For the operator, which persists workflow-state.json to disk and may need to evolve the schema across sessions, this is the primary operational risk. The mitigation is a version field in workflow.json and explicit snapshot invalidation on version bump.
Event broadcasting is a footgun. All parallel regions receive all events. If two regions handle the same event type differently, you must disambiguate with guards or distinct event types. Neither scales. The operator's event vocabulary (DONE, FAIL, PASS, BLOCKED) is generic -- sending DONE to a parallel state would transition every region that handles DONE. The fix is namespaced events: TASK_1.DONE, TASK_2.DONE. This is manual routing with extra steps, but it is the only correct approach.
Tooling gives out around 3 levels of nesting. The Stately visualizer breaks on compound states with mixed parallel/non-parallel children. The inspector chokes at ~100KB of stringified context. TypeScript type inference slows dramatically with deep nesting (microsoft/TypeScript#39826). The practical ceiling is 2-3 levels before you lose the ability to inspect, visualize, or get fast IDE feedback.
State explosion is implicit. Three parallel regions with 3 states each = 27 possible combinations. XState provides no way to declare that certain combinations are invalid. Guards are runtime checks, not compile-time guarantees. Invalid states must be prevented by careful event handling, not by the type system.
The current flat schema:
{
"id": "feature-name",
"version": 1,
"tasks": {
"task_1": { "initial": "do-work", "states": { ... } },
"task_2": { "initial": "do-work", "states": { ... } }
}
}The proposed parallel-aware schema:
{
"id": "feature-name",
"version": 2,
"generated": "2026-04-12",
"machine": {
"id": "feature-name",
"initial": "phase1",
"context": {
"results": {},
"errors": []
},
"states": {
"phase1": {
"type": "parallel",
"states": {
"task_1": {
"initial": "do-research",
"context": { "retries": 0, "maxRetries": 3 },
"states": {
"do-research": {
"on": {
"TASK_1.DONE": "qa-research",
"TASK_1.BLOCKED": "blocked",
"TASK_1.TRIPPED": "tripped"
}
},
"qa-research": {
"on": {
"TASK_1.PASS": "passed",
"TASK_1.FAIL": [
{ "target": "do-research", "guard": "retriesRemaining", "actions": "incrementRetries" },
{ "target": "tripped" }
]
}
},
"blocked": {
"on": { "TASK_1.UNBLOCKED": "do-research" }
},
"passed": { "type": "final" },
"tripped": { "type": "final" }
}
},
"task_2": {
"initial": "do-research",
"states": {
"do-research": {
"on": {
"TASK_2.DONE": "qa-research",
"TASK_2.TRIPPED": "tripped"
}
},
"qa-research": {
"on": {
"TASK_2.PASS": "passed",
"TASK_2.FAIL": [
{ "target": "do-research", "guard": "retriesRemaining", "actions": "incrementRetries" },
{ "target": "tripped" }
]
}
},
"passed": { "type": "final" },
"tripped": { "type": "final" }
}
}
},
"onDone": [
{ "target": "phase2", "guard": "noErrors" },
{ "target": "tripped" }
]
},
"phase2": {
"initial": "do-consolidation",
"states": {
"do-consolidation": {
"on": {
"CONSOLIDATION.DONE": "qa-consolidation",
"CONSOLIDATION.TRIPPED": "tripped"
}
},
"qa-consolidation": {
"on": {
"CONSOLIDATION.PASS": "passed",
"CONSOLIDATION.FAIL": [
{ "target": "do-consolidation", "guard": "retriesRemaining", "actions": "incrementRetries" },
{ "target": "tripped" }
]
}
},
"passed": { "type": "final" },
"tripped": { "type": "final" }
},
"onDone": "complete"
},
"complete": { "type": "final" },
"tripped": { "type": "final" }
}
}
}Key design decisions:
- Single machine, not a flat map. The entire workflow is one XState machine definition under
machine. The state-ledger builds and runs this machine directly. - Namespaced events.
TASK_1.DONEinstead ofDONE. This prevents event broadcasting from transitioning unintended regions. - History states removed. The
blocked -> histpattern from the current schema is replaced withblocked -> do-research(explicit target). This avoids the history state serialization bug (#5178) entirely. trippedis final. Both at the task level (within a parallel region) and at the workflow level. A tripped task completes its region, allowing the parallel phase'sonDoneto fire. The guardedonDonetransition checks for errors before proceeding.- Phase-level circuit breaker. The
onDoneguard on phase1 checkscontext.errors. If any task tripped, the workflow itself trips rather than proceeding to phase2 with incomplete inputs. - Version field bumped to 2. This signals the schema change and triggers snapshot invalidation in the state-ledger.
The current Ledger interface assumes flat string state values:
interface TaskStatus {
state: string; // flat: "do-work", "qa", "passed"
retries: number;
maxRetries: number;
final: boolean;
}The new interface must handle compound state values:
// The state value is now the XState StateValue type --
// a string for atomic states, a nested object for compound/parallel states
type StateValue = string | { [key: string]: StateValue };
interface WorkflowStatus {
value: StateValue; // e.g., { phase1: { task_1: "do-research", task_2: "qa-research" } }
status: 'active' | 'done' | 'error' | 'stopped';
context: Record<string, unknown>;
tasks: Record<string, { // derived: flatten the parallel regions into per-task status
state: string; // leaf state string for this task
final: boolean;
}>;
}
interface TransitionResult {
previous: StateValue;
event: string;
current: StateValue;
taskAffected: string; // which task actually transitioned (derived from event namespace)
}
interface Ledger {
status(): Promise<WorkflowStatus>;
transition(event: string): Promise<TransitionResult>; // no taskId needed -- event namespace routes it
history(): Promise<TransitionHistoryEntry[]>; // workflow-level history, not per-task
tasks(): Promise<Record<string, { state: string; final: boolean }>>; // convenience: flat task view
}The critical changes:
transition()takes an event, not a taskId + event. The event namespace (TASK_1.DONE) routes to the correct region. The state-ledger no longer needs to know which task to target -- the machine handles routing.status()returns compound state values. The caller must understand nested objects. A conveniencetasks()method provides the flat view by walking the state value tree.stateValue()helper must handle objects. The current implementationJSON.stringifys non-string values. The new implementation should return the rawStateValueand let callers decide how to display it.- History becomes workflow-level. With a single machine, transition history is a single ordered log. Per-task history is derivable by filtering on event namespace.
- The
--taskflag on the CLI becomes optional.state-ledger transition --dir <path> --event TASK_1.DONEreplacesstate-ledger transition --dir <path> --task task_1 --event DONE. The old form can be sugar that prepends the task namespace.
These are not theoretical concerns. Each rule is derived from a documented bug, a silent failure mode, or an empirical degradation threshold.
-
Pin your XState version. The
onDone+ parallel state combination has been broken, re-broken, and partially fixed across at least four major issues. Do not assume an upgrade will not regress this. Lock the version in package.json with an exact version, not a range. -
Do not use history states in parallel regions. History state persistence is broken (Issue #5178). The JSON round-trip corrupts
historyValue, causing silent fallback to initial states. Use explicit transition targets instead of history nodes. -
Namespace all events in parallel regions. Events broadcast to all regions.
DONEsent to a parallel state transitions every region that handlesDONE. UseTASK_1.DONE,TASK_2.DONEto route events to specific tracks. -
Keep nesting to 2 levels maximum. XState imposes no depth limit, but tooling (visualizer, inspector, TypeScript inference) degrades at 3+ levels. Parallel inside sequential is one level. Sequential inside parallel inside sequential is two. Stop there.
-
Make
failed/trippedstates final. If a failed state is nottype: "final", the parallel parent'sonDonenever fires for that region. The workflow deadlocks. Always make error terminal states final, and use guardedonDonetransitions to check for errors before proceeding. -
Give each parallel track its own context key. All regions share one context object. If two regions
assignto the same key, the last one wins. Usecontext.results.task_1,context.results.task_2-- no overlap, no surprises. -
Do not rely on
raisefor cross-region communication.raisein a parallel state transition does not always propagate to sibling regions in the same microstep (Discussion #4456). UsesendTotargeting the actor itself for reliable delivery. -
Do not target sibling region states. Cross-region transitions (from one parallel region to a state in a sibling region) are not reliably supported (Issue #518). The exit action fires but the state change does not occur. Each region must be self-contained.
-
Version your workflow.json. There is no schema evolution story for persisted snapshots. If the machine definition changes and an old snapshot is restored, behavior is undefined. Bump the version field on any machine change and invalidate stale snapshots.
-
Double-serialize for JSON safety.
getPersistedSnapshot()setsoutputanderrortoundefined, whichJSON.stringifysilently drops. UseJSON.parse(JSON.stringify(snapshot))before persisting to convertundefinedto absent keys. -
Test
onDonecompletion for every parallel phase. Write explicit tests that advance all regions to their final states and verify theonDonetransition fires. Write tests where some regions fail and verify the guardedonDoneroutes correctly. Do not trust that this works from documentation alone.
-
History state replacement: the current
blocked -> histpattern resumes at the pre-blocked state. Removing history states means hard-coding the resume target (e.g.,blocked -> do-work). Is this acceptable, or does the operator need a more sophisticated "resume where you left off" mechanism? -
Context isolation: in the proposed schema, all tasks in a parallel phase share one context. The operator's circuit breaker uses per-task
retriesandmaxRetries. Should these be namespaced in shared context (context.task_1.retries) or should tasks be invoked actors with their own contexts? Invoked actors solve the isolation problem but make task substates opaque to the parent. -
Event generation: who constructs the namespaced event string? The operator agent currently sends bare
DONE/FAILevents. With namespaced events, the agent must know the task's namespace. Should the state-ledger CLI accept--task task_1 --event DONEand internally constructTASK_1.DONE, preserving the current DX? -
Schema migration: when workflow.json version bumps, what happens to in-flight workflows? The simplest answer is "invalidate and restart." Is there a case where partial progress must be preserved across schema changes?
-
Consolidation task dependency: the proposed schema gates phase2 (consolidation) on phase1 (all subtasks) via sequential ordering. But the operator's research workflow sometimes needs a consolidation task that reads the outputs of all subtasks. How should subtask outputs flow into the consolidation task's input? Context accumulation in shared
resultsobject? Explicit input mapping in the schema? -
onDoneguard reliability: the guardedonDonepattern ([{ target: "phase2", guard: "noErrors" }, { target: "tripped" }]) relies on context being updated before the guard evaluates. Is this guaranteed by XState's event processing semantics, or is there a timing risk whereassignactions in child regions haven't flushed before the parent'sonDoneguard runs?
- Stately Docs: Parallel States -- Parallel state semantics, event routing,
onDonejoin pattern - Stately Docs: Persistence --
getPersistedSnapshot(),createActorrestoration - Stately Docs: States --
StateValuerepresentation,matches(),hasTag()APIs - Stately Docs: Parent States -- Compound states, event bubbling,
initial,onDone - Stately Docs: Invoke -- Invoke API,
onDone/onError, parent-child communication - Stately Docs: Actors -- Actor types, invoke vs spawn, capabilities matrix
- Stately Docs: Setup --
setup()for named actors with type safety - Stately Docs: Final States --
type: "final"semantics - Stately Docs: Guards -- Guard semantics,
stateIn()guard
- Stately Blog: Persisting and Restoring State -- Full persistence lifecycle
- Baptiste Devessier: Parallel States and Events -- Event broadcasting across parallel regions
- Tim Deschryver: Building Incremental Views -- Parallel states for progressive data loading
- DEV Community: Improve child to parent communication with XState 5 --
sendTo+parentRefpattern - Sandro Maglione: State machines and Actors in XState v5 -- Actor model architecture, root-level invoke
- This Dot Labs: Using XState Actors to Model Async Workflows Safely -- Async workflow patterns
- DEV Community: Nested and Parallel States Using Statecharts -- Nesting patterns introduction
- DeepWiki: State Snapshots and Context --
MachineSnapshotfields,StateValuetype - DeepWiki: Persistence and Rehydration -- Persisted snapshot structure, restoration pipeline
- Statecharts.dev: Parallel State Glossary -- Formalism reference
- Issue #5178: Restoring state breaks history behaviour -- History state serialization bug
- Issue #4383: stateIn guard ID resolution --
stateInguard broken for parallel state IDs - Issue #3170: History states break parallel onDone -- History node counted as unfinished region
- Issue #2349: Nested parallel states don't bubble done events -- SCXML deviation, fixed in PR #4358
- Issue #1341: onDone not triggered -- Various
onDonefailures - Issue #518: Cross-region transitions -- Sibling region targeting fails silently
- Issue #452: Visualizer crashes on nested parallel/compound -- Tooling fragility
- Issue #2048: Inspector chokes at ~100KB context -- Serialization performance
- Discussion #4828: Schema evolution for persisted snapshots -- No migration story
- Discussion #4456: raise in parallel state transitions -- Cross-region raise unreliable
- Discussion #1829: Transition type defaults vs SCXML -- Internal vs external transition semantics
- Discussion #4716: getPersistedSnapshot null vs undefined -- Serialization wart
- Discussion #4697: TypeScript and extracting compound states -- TS type inference limitations
- Discussion #2181: Parallel state machine patterns -- Community discussion
- Microsoft/TypeScript #39826: XState types cause slow type checking -- Deep parameterized State types