Skip to content

Instantly share code, notes, and snippets.

@TosinAF
Last active April 2, 2026 11:03
Show Gist options
  • Select an option

  • Save TosinAF/dfe34acbef8cba0b74d4ac6d7f52acd8 to your computer and use it in GitHub Desktop.

Select an option

Save TosinAF/dfe34acbef8cba0b74d4ac6d7f52acd8 to your computer and use it in GitHub Desktop.
TTS Implementation Plan — Dual Approach (full file + progressive chunks)

TTS Implementation Plan — Dual Approach

Use Case & Requirements (from team discussion)

Primary use case: Playing Harvey assistant responses out loud. Responses can be long.

Control Priorities (from Anna, corrected)

Priority Control Notes
P0 Pause
P0 Resume
P0 +/- 15 sec This IS seeking — player.currentTime += 15 requires seekable audio. Makes seeking a P0 concern.
P1 Scrub bar (drag to arbitrary position) Full seek UI. Mechanically the same as +/- 15s, just a UI difference.
P2 Text highlighting synced with audio Highlight the response text currently being spoken (Joey's idea)

Critical Implication: Seeking Is P0

+/- 15 sec skip is a seek operation under the hood. On iOS, AVAudioPlayer.currentTime += 15 requires a complete audio file or at minimum a seekable buffer. On Web, audio.currentTime += 15 requires Content-Length or MediaSource. This means any approach that doesn't support seeking is incompatible with P0 requirements. Pure streaming (Approach B from the seeking doc) is ruled out for V1.

Jin's Sync Concern: Two Streaming Paths

Jin raised the question: if the assistant response is being streamed AND we're streaming audio, there's a synchronization problem.

For V1, this doesn't apply. V1 is TTS on complete responses — the full text is already on screen, user taps "play", audio generates and plays. One stream (audio), no sync needed. The growing file approach works fine here.

For the future real-time path (TTS while LLM generates), Jin is right. Two independent streams (text + audio) need coordination so users don't hear words before they appear on screen. This requires explicit segment mapping (text chunk X → audio chunk Y) and is a separate problem from audio seeking. The growing file helps playback but doesn't solve sync. Out of scope for V1, but worth noting as future work.


Decision

Build both delivery approaches as separate API routes so we can test and compare the feel on real devices.

Monorepo: harveyai/app Order: Backend API → Web client → iOS client


The Two Routes

Route 1: /api/tts/convert — Full Audio File

POST /api/tts/convert
Body: { text, voice_id?, model? }
Response: complete audio file (audio/mpeg)
Headers: Content-Length, Accept-Ranges: bytes, Content-Type: audio/mpeg

Flow:

  1. Client sends text
  2. Backend calls ElevenLabs, buffers entire response
  3. Backend responds with complete MP3 file + proper headers
  4. Client plays — all controls work (pause, resume, +/- 15s, scrub)

What works:

  • All P0 controls trivially — currentTime += 15 just works
  • P1 scrub bar works perfectly — full file, known duration
  • Simplest client code on both iOS (AVAudioPlayer) and Web (<audio>)

What doesn't:

  • User waits for full generation (~1s short text, ~3-5s summary, ~30s+ full document)
  • Loading state needed ("Generating audio...")

Backend concerns:

  • Memory: ~500KB per request for long text. Manageable with a cap.
  • Timeout: long text can take 30s+ on ElevenLabs side. Need generous timeout.

Route 2: /api/tts/stream — Progressive Chunks (Growing File)

POST /api/tts/stream
Body: { text, voice_id?, model? }
Response: chunked audio stream

Flow:

  1. Client sends text
  2. Backend calls ElevenLabs HTTP Stream endpoint
  3. Backend receives chunks, ensures MP3 frame alignment
  4. Backend streams assembled chunks to client
  5. Client builds a growing audio buffer
  6. +/- 15s works within received range from the start

What works:

  • P0 controls: pause/resume native, +/- 15s works within received audio
  • If user skips forward beyond what's received, queue the skip and execute when enough audio arrives. Backward skip always works.
  • Fast time-to-first-audio (~1-2s)
  • Natural foundation for future real-time LLM-to-speech

What's harder:

  • P1 scrub bar — duration unknown until complete, progress bar grows over time
  • MP3 frame alignment — must send complete frames

Client concerns (Web):

  • MediaSource API — append chunks, seeking within buffered range works natively
  • Safari MediaSource + MP3 requires Safari 17.1+ (Oct 2023). Decision: do we support Safari < 17.1?

Client concerns (iOS):

  • AVAudioPlayer with growing Data — reload with updated data as chunks arrive
  • Must track currentTime across reloads to maintain position
  • +/- 15s: check if target position is within received range. If yes, seek. If no (forward skip), wait for more data.

+/- 15 Sec Skip: How It Works Per Route

Since this is P0 and is the reason seeking matters, worth being explicit:

Scenario Route 1 (Full File) Route 2 (Growing File)
Skip back 15s (always within range) currentTime -= 15 currentTime -= 15 (always works)
Skip forward 15s (within received audio) currentTime += 15 currentTime += 15 (works)
Skip forward 15s (beyond received audio) Always works (full file) Queue skip, execute when audio arrives. Show "buffering..."
Skip forward 15s (beyond total audio) Clamp to end Clamp to end of received range

The only UX difference: Route 2 may occasionally show "buffering..." on forward skip if the user jumps ahead of what's been received. This only happens early in playback of long text.


Shared Backend Components

Both routes share:

  • ElevenLabs client — wrapper with auth, retry, error handling
  • Voice/model config — default voice ID, Flash v2.5 model
  • Text validation — length limits (40k char max), sanitization
  • Rate limiting — per-user TTS request limits
  • Logging/metrics — generation time, text length, error rates

Comparison Protocol

Once both routes are live, test on real devices:

Test What to compare
Time to first audio How long until user hears something?
+/- 15s feel Does skip feel responsive on both? Does Route 2 handle forward-skip-beyond-buffer gracefully?
Long response Full assistant response (~5 min audio). Is Route 1's wait acceptable? Does Route 2 stay smooth?
Short response A paragraph. Does Route 1's wait even matter?
Error recovery Connection drop mid-stream? Route 1 is atomic. Route 2 needs graceful degradation.

Implementation Order

Phase 1: Backend (both routes)

  1. ElevenLabs client wrapper (shared)
  2. /api/tts/convert endpoint (full file)
  3. /api/tts/stream endpoint (progressive chunks)
  4. Tests for both

Phase 2: Web client

  1. Basic audio player UI (pause, resume, +/- 15 sec)
  2. Wire up /api/tts/convert (simple — <audio src="blob:...">)
  3. Wire up /api/tts/stream (MediaSource API)
  4. Toggle between routes for comparison

Phase 3: iOS client

  1. Basic audio player UI (same controls)
  2. Wire up /api/tts/convert (AVAudioPlayer with Data)
  3. Wire up /api/tts/stream (AVAudioPlayer reload with growing Data)
  4. Toggle for comparison

Future: Real-Time LLM-to-Speech (Out of Scope for V1)

When we want TTS to start while the assistant is still generating:

LLM streams tokens → Backend batches into sentences
    → Each sentence sent to ElevenLabs
    → Audio chunks stream to client per-sentence
    → Client plays sentences sequentially

Jin's sync concern applies here: Two independent streams (text appearing on screen + audio playing) need explicit coordination. This requires segment mapping (text chunk X → audio chunk Y) so audio never gets ahead of text. The growing file approach helps playback/seeking but does NOT solve this sync problem — that needs separate architecture work.

Joey's approach: Split on periods/newlines, send each chunk to TTS independently. Map each chunk to its text segment for sync.

Route 2's progressive chunks infrastructure transfers directly to this use case.


Open Questions

  1. Safari MediaSource + MP3 — Do we support Safari < 17.1? If yes, Route 2 needs AAC/fMP4 on backend.
  2. Audio caching — Cache generated audio (keyed on text hash + voice + model)? Saves cost on repeat listens.
  3. Max text length — Flash v2.5 supports 40k chars. Do we set a lower app limit?
  4. Voice selection — Single default voice for V1, or expose voice picker?
  5. harveyai/app repo structure — Where does TTS backend code live?
  6. Text-to-audio segment mapping — Even for V1, if we track which text produced which audio chunk, it enables P2 (text highlighting) and future sync. Worth building into the data model early?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment