The implementation is strong in architectural intent: declarative coordinator composition, clean orchestration abstractions, strong journaling model, and good test scaffolding. The main weaknesses are production-hardening concerns: timeout/cancellation behavior, lifecycle cleanup, and brittle reflection in local transport.
@Coordinator, @Fleet, and @AgentRef provide a readable topology contract. Class-based refs (type) balance compile safety with IDE navigation; name-based refs (value) handle remote/cross-module cases. The design scales naturally from 2 agents to 20.
modules/coordinator/src/main/java/org/atmosphere/coordinator/annotation/Coordinator.javamodules/coordinator/src/main/java/org/atmosphere/coordinator/annotation/Fleet.javamodules/coordinator/src/main/java/org/atmosphere/coordinator/annotation/AgentRef.java
AgentFleetowns orchestration patterns (parallel/pipeline).AgentProxyencapsulates agent-level behavior (call/stream/availability).AgentTransportisolates local vs remote communication.AgentCallas a pure spec record cleanly separates what from when.
CoordinationEvent uses sealed interface + records — immutable, exhaustive switch, and clean serialization. toLogLine() per variant is a nice debugging touch. The JournalFormat pluggable rendering (StandardLog, Markdown) is well thought out.
DefaultAgentFleet.parallel() uses newVirtualThreadPerTaskExecutor(), DefaultAgentProxy.callAsync() uses Thread.startVirtualThread(). Exactly right for IO-bound agent calls.
JournalingAgentFleet wraps AgentFleet transparently — opt-in via ServiceLoader, records all parallel()/pipeline()/agent().call() paths. Auto-evaluation is async and non-blocking.
StubAgentFleet, StubAgentTransport, StubAgentRuntime, CoordinatorAssertions — a real testing toolkit. Tests verify parallel execution, pipeline abort, coordination ID sharing, cross-thread isolation, and async evaluation timing.
End-to-end coordinator + WebSocket + worker interactions are tested via CoordinatorWebSocketIntegrationTest.
LocalAgentTransport.java:64-70,120-127: Uses getDeclaredField("protocolHandler") + setAccessible(true) + getMethod("handleMessage"). This silently breaks if A2aHandler renames the field or A2aProtocolHandler changes method signatures. A proper in-process dispatch SPI (e.g. LocalDispatchable in protocol-common) would replace fragile reflection with a compile-checked contract.
DefaultAgentFleet.java:101: CompletableFuture.join() with no timeout. A hung agent stalls the entire parallel() call — and therefore the coordinator's @Prompt method — forever. orTimeout() or get(timeout, unit) would bound the blast radius.
DefaultAgentFleet.java:84,110: TheVirtualThreadPerTaskExecutoris created at line 84 and closed at line 110. TodayvtExecutor.close()runs because the join loop catches exceptions per-future, buttry-with-resourcesis the right hardening move — no-cost safety net.JournalingAgentFleet.java:51:evalExecutoris created as a field but never shut down. Noclose()/shutdown()lifecycle method exists.
JournalingAgentFleet.java:52,156-163: activeCoordinationId (ThreadLocal) is set once and never remove()'d. A long-lived thread reuses the same coordination ID across unrelated requests. Current tests enforce same-thread reuse semantics (JournalingAgentFleetTest:176-214), but the lack of clearing is a latent bug for production workloads with thread pooling.
CoordinatorProcessor.java:248-265: resolveAgentName() errors say "@AgentRef type com.foo.Bar has neither @Agent nor @Coordinator" but don't mention which coordinator the ref belongs to. When multiple coordinators are wired, this makes debugging harder. Including the coordinator name and ref position would help.
~750 lines handling: instance creation, skill file loading, command scanning, prompt resolution, AI infrastructure, fleet resolution, cycle detection, journal wiring, A2A/MCP/AG-UI bridge registration, channel wiring, topology logging, plus multiple inner bridge classes. This increases maintenance cost and drift risk against AgentProcessor. Should be decomposed into collaborators.
AgentCall and transport APIs use Map<String, String>, limiting structured/nested payloads. Map<String, Object> would match what most agent protocols actually support.
weight propagates into proxies and topology logs but no selection/load-balancing policy consumes it. It's dead metadata until a routing strategy is implemented.
LocalAgentTransport.buildJsonRpc() and A2aAgentTransport.buildJsonRpc() are near-identical. Same for extractArtifactText(). Should be extracted to a shared utility (possibly in protocol-common).
No built-in retry for transient failures. Combined with the unused weight field, there's no failover story.
Each pipeline step executes independently — the output of step N isn't passed as input to step N+1. The current pipeline is just "sequential execution with abort," not a data pipeline.
- Fleet behavior (parallel/pipeline, duplicate key handling):
DefaultAgentFleetTest - Journaling behavior and async evaluation:
JournalingAgentFleetTest - Transport edge cases (fallback, error parsing):
A2aAgentTransportTest,LocalAgentTransportTest - Processor annotation resolution:
CoordinatorProcessorTest - E2E path:
CoordinatorWebSocketIntegrationTest
- Timeout/cancellation tests for fan-out and transport stalls
- Reflection failure-path tests in local transport (field rename, method signature change)
- Lifecycle shutdown tests for journaling executor
- High-concurrency stress tests for dependency graph and journaling throughput
- Add timeout-aware parallel execution (
orTimeout) and usetry-with-resourcesinDefaultAgentFleet.parallel(). - Replace reflective local dispatch with an explicit in-process dispatch SPI.
- Define and enforce coordination ID lifecycle semantics (per-request vs thread scope), with deterministic clearing.
- Split
CoordinatorProcessorinto collaborators shared withAgentProcessorfor runtime/pipeline/protocol wiring. - Introduce a richer payload model for agent calls (
Map<String, Object>or typed argument object) with backward compatibility.
| Aspect | Rating |
|---|---|
| API / Architecture | A- |
| Modern Java idioms | A |
| Test coverage | B+ |
| Documentation (README) | A |
| Transport layer | B- |
| Processor complexity | B |
| Production hardening (timeouts, retries, cleanup) | B- |
| Maintainability trajectory | B (→ A after lifecycle/timeout/reflection fixes) |