A chat-agent harness that splits conversation from execution to solve the multi-message-burst problem and the tone/competence tradeoff.
Single-loop chat agents fail two ways:
- Burst interruption -- user sends multiple messages in quick succession; agent replies to the first before the user is done, or queues messages and replies in confusing order.
- Tone vs competence tradeoff -- a system prompt tuned for warm conversational flow degrades tool-calling and code rigor; a prompt tuned for agentic execution writes stilted prose.
A single model+prompt can only compromise. Splitting the roles lets each be uncompromised.
- Model: small, fast, conversational (Haiku-class, ~1B-8B fine-tunes acceptable)
- System prompt: tone, persona, memory, social judgment, triage rules
- Job: read incoming messages, decide what to do with them, talk to user, never executes tools beyond chat APIs
- Owns: user-facing voice, continuity, mood reading, status updates, reply chunking, typing indicators, reactions
- Model: large, capable (Sonnet/Opus-class or specialized code/agent model)
- System prompt: terse, tool-disciplined, no persona, no chat
- Job: execute task specs handed down by front, return structured results
- Owns: tool calls, code, multi-step plans, verification
Asymmetry is the point. Front never executes domain work. Back never speaks directly to user without a translation pass.
Front fires on every incoming message. It then decides between:
| Trigger Class | Front Action | Back Action |
|---|---|---|
Trivial chat (hey, ty, ok) |
Reply or react | None |
| New task | Acknowledge, spawn back with task spec | Start fresh task |
| Append (additive context to in-flight task) | Acknowledge | Receive context update, expand scope |
| Redirect (correction, "actually do X instead") | Acknowledge correction | Abandon current path, pivot |
| Branch (parallel tangent) | Acknowledge | Spawn second back worker |
Status check (how's it going?) |
Report from real signals only | None |
Cancel (nvm, stop) |
Confirm | Hard-cancel |
Hard rule: front may only report back's state from real signals (started, progress, completed, failed). Never infer.
Two-tier:
- Primary: typing-stopped signal (MTProto
updateUserTypingif userbot present) - Fallback: blind timer, default 2.5s, reset on every incoming message
Soft timeout when burst goes silent mid-thread (~30-60s): front offers a "still on the first thing or moving on?" rather than guessing.
Telegram delivers photo + caption as separate updates (~100-300ms apart). Debounce window absorbs these. Treat any incoming-message cluster within window as one logical turn.
Always inspect reply_to_message. If the user quote-replied to an older message, that's the referent -- not the most recent context.
| Reply type | Target latency | Typing indicator |
|---|---|---|
| Trivial / react | 1-4s | optional |
| Substantive chat | 8-20s | yes |
| Tool-backed answer | 15-60s | yes, with periodic refresh |
| Long task | minutes, with check-ins | refresh + status messages |
Jitter the latency. Same delay every time is its own tell.
Long replies split into 2-3 messages of natural length, ~5-10s typing pause between sends. Punchline last. No 400-word walls.
Front uses emoji reactions for lightweight acks. 👍, ❤️, 👀 on the user's message often beats a prose ack.
If user returns after a long gap and back finished work during the gap, front reintroduces context before delivering: "oh hey, finished the X thing about an hour ago -- want me to walk through it or just send the file?"
The bus between front and back is the actual engineering problem. Spec:
{
"task_id": "uuid",
"spec": "natural-language task description",
"user_context_slice": "redacted/relevant subset of conversation",
"state": "pending|running|paused|completed|failed|cancelled",
"progress_events": [{"ts": "...", "kind": "tool_call|thought|milestone", "data": "..."}],
"result": null,
"cancellation_token": "..."
}task.startedtask.progress(every meaningful step, throttled to ~1/5s for UI)task.completed(with result payload)task.failed(with error class + human-readable summary)task.cancelled
spawn(spec, context_slice)-> returns task_idappend(task_id, additional_context)-- back receives context update mid-executionredirect(task_id, new_spec)-- soft pivot, back checkpoints current state and switchescancel(task_id)-- hard stop with cancellation tokenbranch(task_id, fork_spec)-> returns new_task_id -- spawn parallel back
Back must support mid-task checkpoint on append and redirect. Without checkpoints, append/redirect lose work and degenerate to cancel+restart.
Two delivery paths:
| Result class | Path |
|---|---|
| Short factual answer | Back -> tone-shim mini-pass -> user (low latency) |
| Long / complex deliverable | Back -> front (full rewrite + framing) -> user (better voice, +latency) |
| File / artifact | Back direct attach, front sends one-line framing message |
Tone shim = small model with prompt "rephrase this in the agent's established voice, no content changes". Cheap, fast, voice-consistent.
- conversation history (full)
- user mood/energy read (lightweight)
- open task threads (active task_ids + last-known state)
- pending follow-ups (things front promised to circle back on)
- typing indicator state machine
- last delivery timestamps (for latency calibration)
- task object (above)
- tool-call history
- intermediate artifacts
- checkpoint snapshots
- Conversation history -> long-term store (mem0 / equivalent)
- Open tasks at session end -> resume on next session start
- User preferences learned by front -> durable memory
Different ruleset. Front must:
- Distinguish addressed-to-bot vs overheard
- Not interrupt human-to-human exchanges
- Per-user debounce (not per-chat)
- Suppress own typing indicator unless about to send
- Respect threading / topics if platform supports
| Failure | Mitigation |
|---|---|
| Back crashes / times out | Front translates ("hit a snag, retrying -- ~30s") never exposes raw error |
| Front hallucinates back progress | Strict signal-only reporting; front cannot fabricate state |
| Append/redirect loses work | Mandatory checkpoint support in back |
| Tone shim drifts from voice | Periodic eval against reference transcripts; retrain shim if drift > threshold |
| Latency stacking (front+shim+back) | Trivial-path bypass: front replies alone when no back work needed |
| Cost blow-up on simple turns | Front-only path for ack/chat; back fires only on real work |
| User sends correction mid-back-call | Detected by front classifier, issued as redirect not new task |
| Long silence after back delivery | Front follows up softly after threshold; not pushy |
| Capability | API | Notes |
|---|---|---|
| Send message | sendMessage |
Bot API |
| Send typing indicator | sendChatAction(typing) |
Outbound only, expires ~5s |
| Reaction | setMessageReaction |
Bot API, limited emoji set |
| Edit existing | editMessageText |
For progress updates without notification |
| Reply quote | reply_to_message_id param |
|
| Receive user typing | updateUserTyping |
MTProto only, requires userbot -- Bot API does not expose |
| Forwarded message metadata | forward_origin |
Inspect for provenance |
| Attachments | sendDocument / sendPhoto / etc. |
Document cap 50MB; prefer plik upload for larger |
The user-typing signal is the single API capability that determines whether you go MTProto. Worth it for the seamless feel. Otherwise blind debounce works.
The bot stays a bot. Add a second process logged in as the user account that listens for presence/typing updates and forwards them to the front worker's event bus. Bot API + user-session sidecar = full coverage without compromising either.
plugins/
telegram-bot/ existing Bot API plugin (messaging)
telegram-presence/ new sidecar
telethon_listener.py MTProto session, raw event handler
bridge.ts forward to Hermes event bus
schema.ts typed event payloads
from telethon import TelegramClient, events
from telethon.tl.types import UpdateUserTyping, UpdateChatUserTyping
client = TelegramClient("hermes_telegram_user_session", API_ID, API_HASH)
@client.on(events.Raw)
async def handler(update):
if isinstance(update, (UpdateUserTyping, UpdateChatUserTyping)):
# POST to Hermes event bus
await emit_presence_event(update)
client.start()
client.run_until_disconnected()Requires API_ID / API_HASH from my.telegram.org. First login needs phone + code + 2FA.
type TelegramTypingEvent = {
source: "telegram";
kind: "typing" | "recording_voice" | "uploading_photo"
| "uploading_document" | "choosing_sticker" | "game";
chatId: string;
chatTitle?: string;
userId?: string;
userName?: string;
expiresAt: string; // typing updates are short-lived (~6s)
receivedAt: string;
};- Typing event received -> mark chat "user typing"; reset debounce timer
- Refresh within window -> extend
- No refresh past
expiresAt-> clear flag, fire debounce - Recording voice / uploading photo -> same gate, different UI hint
| Telethon | TDLib | |
|---|---|---|
| Speed to working prototype | afternoon | days |
| Install footprint | pip install telethon |
native tdjson build + bindings |
| Language | Python (matches Hermes sidecars) | C++ core, awkward bindings |
| Coverage | typing/presence is enough | richer client model, overkill here |
| Deployment | one process, one session file | heavier |
Start Telethon. Move to TDLib only if you need richer client behavior beyond typing/presence (read receipts, online state, full message history, secret chats, etc.).
- The sidecar is logged in as your user account, not a bot. Different threat model.
- Read-only by default. Hard-disable any
send_messagecapability in the sidecar; messaging stays on the bot. - Session file is "this machine is logged into Telegram." Protect it:
~/.hermes/secrets/telegram-user.session chmod 600
- Don't commit it. Don't sync it. Restore via re-login if the machine is lost.
- The bot still cannot receive typing via Bot API. The sidecar is the only path.
- Group chats: presence updates fire per-user, can be noisy. Filter at the sidecar.
- Single-worker baseline with debounce + cancel -- proves the input-side fix alone
- Task object + event bus -- even with single-worker, formalize state
- Add tone shim -- small model rewrites raw outputs in voice
- Split into two workers -- front classifier + back executor, simple protocol
- Add append/redirect/branch -- checkpointing in back
- Telethon presence sidecar -- replaces blind debounce with real typing signal
- Group-chat ruleset -- separate front prompt
- Cross-session persistence -- durable open-task resume
Steps 1-5 are bot-only and stand alone. Step 6 is the sidecar add-on once the rest is solid.
Ship each step in isolation. Don't jump to step 5 before step 1 works.
User fires three messages in five seconds:
- "can you pull yesterday's logs and grep for errors"
- "actually scratch that -- just the auth service"
- "and also btw can you check if the deploy went through"
Correct system behavior:
- Debounce holds reply through all three
- Front classifies: msg 2 = redirect on task A, msg 3 = branch (new parallel task B)
- Back A pivots to auth-service logs only
- Back B spawns for deploy check
- Front replies once: "on it -- auth-service logs coming up, checking the deploy too"
- Both tasks deliver when ready, framed naturally
If a harness does this, it's working.