Two-Worker Agent Harness -- Spec Sheet

A chat-agent harness that splits conversation from execution to solve the multi-message-burst problem and the tone/competence tradeoff.

Problem Statement

Single-loop chat agents fail two ways:

Burst interruption -- user sends multiple messages in quick succession; agent replies to the first before the user is done, or queues messages and replies in confusing order.
Tone vs competence tradeoff -- a system prompt tuned for warm conversational flow degrades tool-calling and code rigor; a prompt tuned for agentic execution writes stilted prose.

A single model+prompt can only compromise. Splitting the roles lets each be uncompromised.

Roles

Front Worker -- Conversation Layer

Model: small, fast, conversational (Haiku-class, ~1B-8B fine-tunes acceptable)
System prompt: tone, persona, memory, social judgment, triage rules
Job: read incoming messages, decide what to do with them, talk to user, never executes tools beyond chat APIs
Owns: user-facing voice, continuity, mood reading, status updates, reply chunking, typing indicators, reactions

Back Worker -- Execution Layer

Model: large, capable (Sonnet/Opus-class or specialized code/agent model)
System prompt: terse, tool-disciplined, no persona, no chat
Job: execute task specs handed down by front, return structured results
Owns: tool calls, code, multi-step plans, verification

Asymmetry is the point. Front never executes domain work. Back never speaks directly to user without a translation pass.

Triggers

Front fires on every incoming message. It then decides between:

Trigger Class	Front Action	Back Action
Trivial chat (`hey`, `ty`, `ok`)	Reply or react	None
New task	Acknowledge, spawn back with task spec	Start fresh task
Append (additive context to in-flight task)	Acknowledge	Receive context update, expand scope
Redirect (correction, "actually do X instead")	Acknowledge correction	Abandon current path, pivot
Branch (parallel tangent)	Acknowledge	Spawn second back worker
Status check (`how's it going?`)	Report from real signals only	None
Cancel (`nvm`, `stop`)	Confirm	Hard-cancel

Hard rule: front may only report back's state from real signals (started, progress, completed, failed). Never infer.

Input-Side Mechanics

Debounce

Two-tier:

Primary: typing-stopped signal (MTProto updateUserTyping if userbot present)
Fallback: blind timer, default 2.5s, reset on every incoming message

Soft timeout when burst goes silent mid-thread (~30-60s): front offers a "still on the first thing or moving on?" rather than guessing.

Attachment grouping

Telegram delivers photo + caption as separate updates (~100-300ms apart). Debounce window absorbs these. Treat any incoming-message cluster within window as one logical turn.

Reply-quote resolution

Always inspect reply_to_message. If the user quote-replied to an older message, that's the referent -- not the most recent context.

Output-Side Mechanics

Reply latency model

Reply type	Target latency	Typing indicator
Trivial / react	1-4s	optional
Substantive chat	8-20s	yes
Tool-backed answer	15-60s	yes, with periodic refresh
Long task	minutes, with check-ins	refresh + status messages

Jitter the latency. Same delay every time is its own tell.

Reply chunking

Long replies split into 2-3 messages of natural length, ~5-10s typing pause between sends. Punchline last. No 400-word walls.

Reactions over replies

Front uses emoji reactions for lightweight acks. 👍, ❤️, 👀 on the user's message often beats a prose ack.

Async resumption

If user returns after a long gap and back finished work during the gap, front reintroduces context before delivering: "oh hey, finished the X thing about an hour ago -- want me to walk through it or just send the file?"

Inter-Worker Protocol

The bus between front and back is the actual engineering problem. Spec:

Task object

{
 "task_id": "uuid",
 "spec": "natural-language task description",
 "user_context_slice": "redacted/relevant subset of conversation",
 "state": "pending|running|paused|completed|failed|cancelled",
 "progress_events": [{"ts": "...", "kind": "tool_call|thought|milestone", "data": "..."}],
 "result": null,
 "cancellation_token": "..."
}

Events front subscribes to

task.started
task.progress (every meaningful step, throttled to ~1/5s for UI)
task.completed (with result payload)
task.failed (with error class + human-readable summary)
task.cancelled

Commands front can issue

spawn(spec, context_slice) -> returns task_id
append(task_id, additional_context) -- back receives context update mid-execution
redirect(task_id, new_spec) -- soft pivot, back checkpoints current state and switches
cancel(task_id) -- hard stop with cancellation token
branch(task_id, fork_spec) -> returns new_task_id -- spawn parallel back

Checkpointing

Back must support mid-task checkpoint on append and redirect. Without checkpoints, append/redirect lose work and degenerate to cancel+restart.

Result Delivery

Two delivery paths:

Result class	Path
Short factual answer	Back -> tone-shim mini-pass -> user (low latency)
Long / complex deliverable	Back -> front (full rewrite + framing) -> user (better voice, +latency)
File / artifact	Back direct attach, front sends one-line framing message

Tone shim = small model with prompt "rephrase this in the agent's established voice, no content changes". Cheap, fast, voice-consistent.

State

Per-chat state (front-owned)

conversation history (full)
user mood/energy read (lightweight)
open task threads (active task_ids + last-known state)
pending follow-ups (things front promised to circle back on)
typing indicator state machine
last delivery timestamps (for latency calibration)

Per-task state (back-owned, front-readable)

task object (above)
tool-call history
intermediate artifacts
checkpoint snapshots

Cross-session persistence

Conversation history -> long-term store (mem0 / equivalent)
Open tasks at session end -> resume on next session start
User preferences learned by front -> durable memory

Group-Chat Behavior

Different ruleset. Front must:

Distinguish addressed-to-bot vs overheard
Not interrupt human-to-human exchanges
Per-user debounce (not per-chat)
Suppress own typing indicator unless about to send
Respect threading / topics if platform supports

Failure Modes & Mitigations

Failure	Mitigation
Back crashes / times out	Front translates ("hit a snag, retrying -- ~30s") never exposes raw error
Front hallucinates back progress	Strict signal-only reporting; front cannot fabricate state
Append/redirect loses work	Mandatory checkpoint support in back
Tone shim drifts from voice	Periodic eval against reference transcripts; retrain shim if drift > threshold
Latency stacking (front+shim+back)	Trivial-path bypass: front replies alone when no back work needed
Cost blow-up on simple turns	Front-only path for ack/chat; back fires only on real work
User sends correction mid-back-call	Detected by front classifier, issued as `redirect` not new task
Long silence after back delivery	Front follows up softly after threshold; not pushy

Telegram-Specific APIs

Capability	API	Notes
Send message	`sendMessage`	Bot API
Send typing indicator	`sendChatAction(typing)`	Outbound only, expires ~5s
Reaction	`setMessageReaction`	Bot API, limited emoji set
Edit existing	`editMessageText`	For progress updates without notification
Reply quote	`reply_to_message_id` param
Receive user typing	`updateUserTyping`	MTProto only, requires userbot -- Bot API does not expose
Forwarded message metadata	`forward_origin`	Inspect for provenance
Attachments	`sendDocument` / `sendPhoto` / etc.	Document cap 50MB; prefer plik upload for larger

The user-typing signal is the single API capability that determines whether you go MTProto. Worth it for the seamless feel. Otherwise blind debounce works.

Presence Sidecar (MTProto Listener)

The bot stays a bot. Add a second process logged in as the user account that listens for presence/typing updates and forwards them to the front worker's event bus. Bot API + user-session sidecar = full coverage without compromising either.

Architecture

plugins/
 telegram-bot/ existing Bot API plugin (messaging)
 telegram-presence/ new sidecar
 telethon_listener.py MTProto session, raw event handler
 bridge.ts forward to Hermes event bus
 schema.ts typed event payloads

Implementation -- Telethon (recommended first cut)

from telethon import TelegramClient, events
from telethon.tl.types import UpdateUserTyping, UpdateChatUserTyping

client = TelegramClient("hermes_telegram_user_session", API_ID, API_HASH)

@client.on(events.Raw)
async def handler(update):
 if isinstance(update, (UpdateUserTyping, UpdateChatUserTyping)):
 # POST to Hermes event bus
 await emit_presence_event(update)

client.start()
client.run_until_disconnected()

Requires API_ID / API_HASH from my.telegram.org. First login needs phone + code + 2FA.

Event payload

type TelegramTypingEvent = {
 source: "telegram";
 kind: "typing" | "recording_voice" | "uploading_photo"
 | "uploading_document" | "choosing_sticker" | "game";
 chatId: string;
 chatTitle?: string;
 userId?: string;
 userName?: string;
 expiresAt: string; // typing updates are short-lived (~6s)
 receivedAt: string;
};

Front-worker consumption -- ephemeral UI state

Typing event received -> mark chat "user typing"; reset debounce timer
Refresh within window -> extend
No refresh past expiresAt -> clear flag, fire debounce
Recording voice / uploading photo -> same gate, different UI hint

Why Telethon over TDLib first

	Telethon	TDLib
Speed to working prototype	afternoon	days
Install footprint	`pip install telethon`	native `tdjson` build + bindings
Language	Python (matches Hermes sidecars)	C++ core, awkward bindings
Coverage	typing/presence is enough	richer client model, overkill here
Deployment	one process, one session file	heavier

Start Telethon. Move to TDLib only if you need richer client behavior beyond typing/presence (read receipts, online state, full message history, secret chats, etc.).

Security caveats

The sidecar is logged in as your user account, not a bot. Different threat model.
Read-only by default. Hard-disable any send_message capability in the sidecar; messaging stays on the bot.
Session file is "this machine is logged into Telegram." Protect it:

~/.hermes/secrets/telegram-user.session chmod 600

Don't commit it. Don't sync it. Restore via re-login if the machine is lost.

What this does not give you

The bot still cannot receive typing via Bot API. The sidecar is the only path.
Group chats: presence updates fire per-user, can be noisy. Filter at the sidecar.

Build Order

Single-worker baseline with debounce + cancel -- proves the input-side fix alone
Task object + event bus -- even with single-worker, formalize state
Add tone shim -- small model rewrites raw outputs in voice
Split into two workers -- front classifier + back executor, simple protocol
Add append/redirect/branch -- checkpointing in back
Telethon presence sidecar -- replaces blind debounce with real typing signal
Group-chat ruleset -- separate front prompt
Cross-session persistence -- durable open-task resume

Steps 1-5 are bot-only and stand alone. Step 6 is the sidecar add-on once the rest is solid.

Ship each step in isolation. Don't jump to step 5 before step 1 works.

Killer Demo

User fires three messages in five seconds:

"can you pull yesterday's logs and grep for errors"
"actually scratch that -- just the auth service"
"and also btw can you check if the deploy went through"

Correct system behavior:

Debounce holds reply through all three
Front classifies: msg 2 = redirect on task A, msg 3 = branch (new parallel task B)
Back A pivots to auth-service logs only
Back B spawns for deploy check
Front replies once: "on it -- auth-service logs coming up, checking the deploy too"
Both tasks deliver when ready, framed naturally

If a harness does this, it's working.

ildunari/agent-harness-spec.md