Skip to content

Instantly share code, notes, and snippets.

@ildunari
Created May 23, 2026 00:23
Show Gist options
  • Select an option

  • Save ildunari/f9feb04b5d06e6f8540b7f045fc23a69 to your computer and use it in GitHub Desktop.

Select an option

Save ildunari/f9feb04b5d06e6f8540b7f045fc23a69 to your computer and use it in GitHub Desktop.
Two-Worker Agent Harness — spec sheet for solving multi-message burst + tone/competence tradeoff in chat agents

Two-Worker Agent Harness -- Spec Sheet

A chat-agent harness that splits conversation from execution to solve the multi-message-burst problem and the tone/competence tradeoff.


Problem Statement

Single-loop chat agents fail two ways:

  1. Burst interruption -- user sends multiple messages in quick succession; agent replies to the first before the user is done, or queues messages and replies in confusing order.
  2. Tone vs competence tradeoff -- a system prompt tuned for warm conversational flow degrades tool-calling and code rigor; a prompt tuned for agentic execution writes stilted prose.

A single model+prompt can only compromise. Splitting the roles lets each be uncompromised.


Roles

Front Worker -- Conversation Layer

  • Model: small, fast, conversational (Haiku-class, ~1B-8B fine-tunes acceptable)
  • System prompt: tone, persona, memory, social judgment, triage rules
  • Job: read incoming messages, decide what to do with them, talk to user, never executes tools beyond chat APIs
  • Owns: user-facing voice, continuity, mood reading, status updates, reply chunking, typing indicators, reactions

Back Worker -- Execution Layer

  • Model: large, capable (Sonnet/Opus-class or specialized code/agent model)
  • System prompt: terse, tool-disciplined, no persona, no chat
  • Job: execute task specs handed down by front, return structured results
  • Owns: tool calls, code, multi-step plans, verification

Asymmetry is the point. Front never executes domain work. Back never speaks directly to user without a translation pass.


Triggers

Front fires on every incoming message. It then decides between:

Trigger Class Front Action Back Action
Trivial chat (hey, ty, ok) Reply or react None
New task Acknowledge, spawn back with task spec Start fresh task
Append (additive context to in-flight task) Acknowledge Receive context update, expand scope
Redirect (correction, "actually do X instead") Acknowledge correction Abandon current path, pivot
Branch (parallel tangent) Acknowledge Spawn second back worker
Status check (how's it going?) Report from real signals only None
Cancel (nvm, stop) Confirm Hard-cancel

Hard rule: front may only report back's state from real signals (started, progress, completed, failed). Never infer.


Input-Side Mechanics

Debounce

Two-tier:

  1. Primary: typing-stopped signal (MTProto updateUserTyping if userbot present)
  2. Fallback: blind timer, default 2.5s, reset on every incoming message

Soft timeout when burst goes silent mid-thread (~30-60s): front offers a "still on the first thing or moving on?" rather than guessing.

Attachment grouping

Telegram delivers photo + caption as separate updates (~100-300ms apart). Debounce window absorbs these. Treat any incoming-message cluster within window as one logical turn.

Reply-quote resolution

Always inspect reply_to_message. If the user quote-replied to an older message, that's the referent -- not the most recent context.


Output-Side Mechanics

Reply latency model

Reply type Target latency Typing indicator
Trivial / react 1-4s optional
Substantive chat 8-20s yes
Tool-backed answer 15-60s yes, with periodic refresh
Long task minutes, with check-ins refresh + status messages

Jitter the latency. Same delay every time is its own tell.

Reply chunking

Long replies split into 2-3 messages of natural length, ~5-10s typing pause between sends. Punchline last. No 400-word walls.

Reactions over replies

Front uses emoji reactions for lightweight acks. 👍, ❤️, 👀 on the user's message often beats a prose ack.

Async resumption

If user returns after a long gap and back finished work during the gap, front reintroduces context before delivering: "oh hey, finished the X thing about an hour ago -- want me to walk through it or just send the file?"


Inter-Worker Protocol

The bus between front and back is the actual engineering problem. Spec:

Task object

{
 "task_id": "uuid",
 "spec": "natural-language task description",
 "user_context_slice": "redacted/relevant subset of conversation",
 "state": "pending|running|paused|completed|failed|cancelled",
 "progress_events": [{"ts": "...", "kind": "tool_call|thought|milestone", "data": "..."}],
 "result": null,
 "cancellation_token": "..."
}

Events front subscribes to

  • task.started
  • task.progress (every meaningful step, throttled to ~1/5s for UI)
  • task.completed (with result payload)
  • task.failed (with error class + human-readable summary)
  • task.cancelled

Commands front can issue

  • spawn(spec, context_slice) -> returns task_id
  • append(task_id, additional_context) -- back receives context update mid-execution
  • redirect(task_id, new_spec) -- soft pivot, back checkpoints current state and switches
  • cancel(task_id) -- hard stop with cancellation token
  • branch(task_id, fork_spec) -> returns new_task_id -- spawn parallel back

Checkpointing

Back must support mid-task checkpoint on append and redirect. Without checkpoints, append/redirect lose work and degenerate to cancel+restart.


Result Delivery

Two delivery paths:

Result class Path
Short factual answer Back -> tone-shim mini-pass -> user (low latency)
Long / complex deliverable Back -> front (full rewrite + framing) -> user (better voice, +latency)
File / artifact Back direct attach, front sends one-line framing message

Tone shim = small model with prompt "rephrase this in the agent's established voice, no content changes". Cheap, fast, voice-consistent.


State

Per-chat state (front-owned)

  • conversation history (full)
  • user mood/energy read (lightweight)
  • open task threads (active task_ids + last-known state)
  • pending follow-ups (things front promised to circle back on)
  • typing indicator state machine
  • last delivery timestamps (for latency calibration)

Per-task state (back-owned, front-readable)

  • task object (above)
  • tool-call history
  • intermediate artifacts
  • checkpoint snapshots

Cross-session persistence

  • Conversation history -> long-term store (mem0 / equivalent)
  • Open tasks at session end -> resume on next session start
  • User preferences learned by front -> durable memory

Group-Chat Behavior

Different ruleset. Front must:

  • Distinguish addressed-to-bot vs overheard
  • Not interrupt human-to-human exchanges
  • Per-user debounce (not per-chat)
  • Suppress own typing indicator unless about to send
  • Respect threading / topics if platform supports

Failure Modes & Mitigations

Failure Mitigation
Back crashes / times out Front translates ("hit a snag, retrying -- ~30s") never exposes raw error
Front hallucinates back progress Strict signal-only reporting; front cannot fabricate state
Append/redirect loses work Mandatory checkpoint support in back
Tone shim drifts from voice Periodic eval against reference transcripts; retrain shim if drift > threshold
Latency stacking (front+shim+back) Trivial-path bypass: front replies alone when no back work needed
Cost blow-up on simple turns Front-only path for ack/chat; back fires only on real work
User sends correction mid-back-call Detected by front classifier, issued as redirect not new task
Long silence after back delivery Front follows up softly after threshold; not pushy

Telegram-Specific APIs

Capability API Notes
Send message sendMessage Bot API
Send typing indicator sendChatAction(typing) Outbound only, expires ~5s
Reaction setMessageReaction Bot API, limited emoji set
Edit existing editMessageText For progress updates without notification
Reply quote reply_to_message_id param
Receive user typing updateUserTyping MTProto only, requires userbot -- Bot API does not expose
Forwarded message metadata forward_origin Inspect for provenance
Attachments sendDocument / sendPhoto / etc. Document cap 50MB; prefer plik upload for larger

The user-typing signal is the single API capability that determines whether you go MTProto. Worth it for the seamless feel. Otherwise blind debounce works.


Presence Sidecar (MTProto Listener)

The bot stays a bot. Add a second process logged in as the user account that listens for presence/typing updates and forwards them to the front worker's event bus. Bot API + user-session sidecar = full coverage without compromising either.

Architecture

plugins/
 telegram-bot/ existing Bot API plugin (messaging)
 telegram-presence/ new sidecar
 telethon_listener.py MTProto session, raw event handler
 bridge.ts forward to Hermes event bus
 schema.ts typed event payloads

Implementation -- Telethon (recommended first cut)

from telethon import TelegramClient, events
from telethon.tl.types import UpdateUserTyping, UpdateChatUserTyping

client = TelegramClient("hermes_telegram_user_session", API_ID, API_HASH)

@client.on(events.Raw)
async def handler(update):
 if isinstance(update, (UpdateUserTyping, UpdateChatUserTyping)):
 # POST to Hermes event bus
 await emit_presence_event(update)

client.start()
client.run_until_disconnected()

Requires API_ID / API_HASH from my.telegram.org. First login needs phone + code + 2FA.

Event payload

type TelegramTypingEvent = {
 source: "telegram";
 kind: "typing" | "recording_voice" | "uploading_photo"
 | "uploading_document" | "choosing_sticker" | "game";
 chatId: string;
 chatTitle?: string;
 userId?: string;
 userName?: string;
 expiresAt: string; // typing updates are short-lived (~6s)
 receivedAt: string;
};

Front-worker consumption -- ephemeral UI state

  • Typing event received -> mark chat "user typing"; reset debounce timer
  • Refresh within window -> extend
  • No refresh past expiresAt -> clear flag, fire debounce
  • Recording voice / uploading photo -> same gate, different UI hint

Why Telethon over TDLib first

Telethon TDLib
Speed to working prototype afternoon days
Install footprint pip install telethon native tdjson build + bindings
Language Python (matches Hermes sidecars) C++ core, awkward bindings
Coverage typing/presence is enough richer client model, overkill here
Deployment one process, one session file heavier

Start Telethon. Move to TDLib only if you need richer client behavior beyond typing/presence (read receipts, online state, full message history, secret chats, etc.).

Security caveats

  • The sidecar is logged in as your user account, not a bot. Different threat model.
  • Read-only by default. Hard-disable any send_message capability in the sidecar; messaging stays on the bot.
  • Session file is "this machine is logged into Telegram." Protect it:
~/.hermes/secrets/telegram-user.session chmod 600
  • Don't commit it. Don't sync it. Restore via re-login if the machine is lost.

What this does not give you

  • The bot still cannot receive typing via Bot API. The sidecar is the only path.
  • Group chats: presence updates fire per-user, can be noisy. Filter at the sidecar.

Build Order

  1. Single-worker baseline with debounce + cancel -- proves the input-side fix alone
  2. Task object + event bus -- even with single-worker, formalize state
  3. Add tone shim -- small model rewrites raw outputs in voice
  4. Split into two workers -- front classifier + back executor, simple protocol
  5. Add append/redirect/branch -- checkpointing in back
  6. Telethon presence sidecar -- replaces blind debounce with real typing signal
  7. Group-chat ruleset -- separate front prompt
  8. Cross-session persistence -- durable open-task resume

Steps 1-5 are bot-only and stand alone. Step 6 is the sidecar add-on once the rest is solid.

Ship each step in isolation. Don't jump to step 5 before step 1 works.


Killer Demo

User fires three messages in five seconds:

  1. "can you pull yesterday's logs and grep for errors"
  2. "actually scratch that -- just the auth service"
  3. "and also btw can you check if the deploy went through"

Correct system behavior:

  • Debounce holds reply through all three
  • Front classifies: msg 2 = redirect on task A, msg 3 = branch (new parallel task B)
  • Back A pivots to auth-service logs only
  • Back B spawns for deploy check
  • Front replies once: "on it -- auth-service logs coming up, checking the deploy too"
  • Both tasks deliver when ready, framed naturally

If a harness does this, it's working.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment