TTS Implementation Plan — Dual Approach

Use Case & Requirements (from team discussion)

Primary use case: Playing Harvey assistant responses out loud. Responses can be long.

Control Priorities (from Anna, corrected)

Priority	Control	Notes
P0	Pause
P0	Resume
P0	+/- 15 sec	This IS seeking — `player.currentTime += 15` requires seekable audio. Makes seeking a P0 concern.
P1	Scrub bar (drag to arbitrary position)	Full seek UI. Mechanically the same as +/- 15s, just a UI difference.
P2	Text highlighting synced with audio	Highlight the response text currently being spoken (Joey's idea)

Critical Implication: Seeking Is P0

+/- 15 sec skip is a seek operation under the hood. On iOS, AVAudioPlayer.currentTime += 15 requires a complete audio file or at minimum a seekable buffer. On Web, audio.currentTime += 15 requires Content-Length or MediaSource. This means any approach that doesn't support seeking is incompatible with P0 requirements. Pure streaming (Approach B from the seeking doc) is ruled out for V1.

Jin's Sync Concern: Two Streaming Paths

Jin raised the question: if the assistant response is being streamed AND we're streaming audio, there's a synchronization problem.

For V1, this doesn't apply. V1 is TTS on complete responses — the full text is already on screen, user taps "play", audio generates and plays. One stream (audio), no sync needed. The growing file approach works fine here.

For the future real-time path (TTS while LLM generates), Jin is right. Two independent streams (text + audio) need coordination so users don't hear words before they appear on screen. This requires explicit segment mapping (text chunk X → audio chunk Y) and is a separate problem from audio seeking. The growing file helps playback but doesn't solve sync. Out of scope for V1, but worth noting as future work.

Decision

Build both delivery approaches as separate API routes so we can test and compare the feel on real devices.

Monorepo: harveyai/app Order: Backend API → Web client → iOS client

The Two Routes

Route 1: `/api/tts/convert` — Full Audio File

POST /api/tts/convert
Body: { text, voice_id?, model? }
Response: complete audio file (audio/mpeg)
Headers: Content-Length, Accept-Ranges: bytes, Content-Type: audio/mpeg

Flow:

Client sends text
Backend calls ElevenLabs, buffers entire response
Backend responds with complete MP3 file + proper headers
Client plays — all controls work (pause, resume, +/- 15s, scrub)

What works:

All P0 controls trivially — currentTime += 15 just works
P1 scrub bar works perfectly — full file, known duration
Simplest client code on both iOS (AVAudioPlayer) and Web (<audio>)

What doesn't:

User waits for full generation (~1s short text, ~3-5s summary, ~30s+ full document)
Loading state needed ("Generating audio...")

Backend concerns:

Memory: ~500KB per request for long text. Manageable with a cap.
Timeout: long text can take 30s+ on ElevenLabs side. Need generous timeout.

Route 2: `/api/tts/stream` — Progressive Chunks (Growing File)

POST /api/tts/stream
Body: { text, voice_id?, model? }
Response: chunked audio stream

Flow:

Client sends text
Backend calls ElevenLabs HTTP Stream endpoint
Backend receives chunks, ensures MP3 frame alignment
Backend streams assembled chunks to client
Client builds a growing audio buffer
+/- 15s works within received range from the start

What works:

P0 controls: pause/resume native, +/- 15s works within received audio
If user skips forward beyond what's received, queue the skip and execute when enough audio arrives. Backward skip always works.
Fast time-to-first-audio (~1-2s)
Natural foundation for future real-time LLM-to-speech

What's harder:

P1 scrub bar — duration unknown until complete, progress bar grows over time
MP3 frame alignment — must send complete frames

Client concerns (Web):

MediaSource API — append chunks, seeking within buffered range works natively
Safari MediaSource + MP3 requires Safari 17.1+ (Oct 2023). Decision: do we support Safari < 17.1?

Client concerns (iOS):

AVAudioPlayer with growing Data — reload with updated data as chunks arrive
Must track currentTime across reloads to maintain position
+/- 15s: check if target position is within received range. If yes, seek. If no (forward skip), wait for more data.

+/- 15 Sec Skip: How It Works Per Route

Since this is P0 and is the reason seeking matters, worth being explicit:

Scenario	Route 1 (Full File)	Route 2 (Growing File)
Skip back 15s (always within range)	`currentTime -= 15`	`currentTime -= 15` (always works)
Skip forward 15s (within received audio)	`currentTime += 15`	`currentTime += 15` (works)
Skip forward 15s (beyond received audio)	Always works (full file)	Queue skip, execute when audio arrives. Show "buffering..."
Skip forward 15s (beyond total audio)	Clamp to end	Clamp to end of received range

The only UX difference: Route 2 may occasionally show "buffering..." on forward skip if the user jumps ahead of what's been received. This only happens early in playback of long text.

Shared Backend Components

Both routes share:

ElevenLabs client — wrapper with auth, retry, error handling
Voice/model config — default voice ID, Flash v2.5 model
Text validation — length limits (40k char max), sanitization
Rate limiting — per-user TTS request limits
Logging/metrics — generation time, text length, error rates

Comparison Protocol

Once both routes are live, test on real devices:

Test	What to compare
Time to first audio	How long until user hears something?
+/- 15s feel	Does skip feel responsive on both? Does Route 2 handle forward-skip-beyond-buffer gracefully?
Long response	Full assistant response (~5 min audio). Is Route 1's wait acceptable? Does Route 2 stay smooth?
Short response	A paragraph. Does Route 1's wait even matter?
Error recovery	Connection drop mid-stream? Route 1 is atomic. Route 2 needs graceful degradation.

Implementation Order

Phase 1: Backend (both routes)

ElevenLabs client wrapper (shared)
/api/tts/convert endpoint (full file)
/api/tts/stream endpoint (progressive chunks)
Tests for both

Phase 2: Web client

Basic audio player UI (pause, resume, +/- 15 sec)
Wire up /api/tts/convert (simple — <audio src="blob:...">)
Wire up /api/tts/stream (MediaSource API)
Toggle between routes for comparison

Phase 3: iOS client

Basic audio player UI (same controls)
Wire up /api/tts/convert (AVAudioPlayer with Data)
Wire up /api/tts/stream (AVAudioPlayer reload with growing Data)
Toggle for comparison

Future: Real-Time LLM-to-Speech (Out of Scope for V1)

When we want TTS to start while the assistant is still generating:

LLM streams tokens → Backend batches into sentences
    → Each sentence sent to ElevenLabs
    → Audio chunks stream to client per-sentence
    → Client plays sentences sequentially

Jin's sync concern applies here: Two independent streams (text appearing on screen + audio playing) need explicit coordination. This requires segment mapping (text chunk X → audio chunk Y) so audio never gets ahead of text. The growing file approach helps playback/seeking but does NOT solve this sync problem — that needs separate architecture work.

Joey's approach: Split on periods/newlines, send each chunk to TTS independently. Map each chunk to its text segment for sync.

Route 2's progressive chunks infrastructure transfers directly to this use case.

Open Questions

Safari MediaSource + MP3 — Do we support Safari < 17.1? If yes, Route 2 needs AAC/fMP4 on backend.
Audio caching — Cache generated audio (keyed on text hash + voice + model)? Saves cost on repeat listens.
Max text length — Flash v2.5 supports 40k chars. Do we set a lower app limit?
Voice selection — Single default voice for V1, or expose voice picker?
harveyai/app repo structure — Where does TTS backend code live?
Text-to-audio segment mapping — Even for V1, if we track which text produced which audio chunk, it enables P2 (text highlighting) and future sync. Worth building into the data model early?

TosinAF/tts-implementation-plan.md

Select an option

No results found

Select an option

No results found

TTS Implementation Plan — Dual Approach

Use Case & Requirements (from team discussion)

Control Priorities (from Anna, corrected)

Critical Implication: Seeking Is P0

Jin's Sync Concern: Two Streaming Paths

Decision

The Two Routes

Route 1: `/api/tts/convert` — Full Audio File

Route 2: `/api/tts/stream` — Progressive Chunks (Growing File)

+/- 15 Sec Skip: How It Works Per Route

Shared Backend Components

Comparison Protocol

Implementation Order

Phase 1: Backend (both routes)

Phase 2: Web client

Phase 3: iOS client

Future: Real-Time LLM-to-Speech (Out of Scope for V1)

Open Questions

TosinAF/tts-implementation-plan.md

TTS Implementation Plan — Dual Approach

Use Case & Requirements (from team discussion)

Control Priorities (from Anna, corrected)

Critical Implication: Seeking Is P0

Jin's Sync Concern: Two Streaming Paths

Decision

The Two Routes

Route 1: /api/tts/convert — Full Audio File

Route 2: /api/tts/stream — Progressive Chunks (Growing File)

+/- 15 Sec Skip: How It Works Per Route

Shared Backend Components

Comparison Protocol

Implementation Order

Phase 1: Backend (both routes)

Phase 2: Web client

Phase 3: iOS client

Future: Real-Time LLM-to-Speech (Out of Scope for V1)

Open Questions

Route 1: `/api/tts/convert` — Full Audio File

Route 2: `/api/tts/stream` — Progressive Chunks (Growing File)