Primary use case: Playing Harvey assistant responses out loud. Responses can be long.
| Priority | Control | Notes |
|---|---|---|
| P0 | Pause | |
| P0 | Resume | |
| P0 | +/- 15 sec | This IS seeking — player.currentTime += 15 requires seekable audio. Makes seeking a P0 concern. |
| P1 | Scrub bar (drag to arbitrary position) | Full seek UI. Mechanically the same as +/- 15s, just a UI difference. |
| P2 | Text highlighting synced with audio | Highlight the response text currently being spoken (Joey's idea) |
+/- 15 sec skip is a seek operation under the hood. On iOS, AVAudioPlayer.currentTime += 15 requires a complete audio file or at minimum a seekable buffer. On Web, audio.currentTime += 15 requires Content-Length or MediaSource. This means any approach that doesn't support seeking is incompatible with P0 requirements. Pure streaming (Approach B from the seeking doc) is ruled out for V1.
Jin raised the question: if the assistant response is being streamed AND we're streaming audio, there's a synchronization problem.
For V1, this doesn't apply. V1 is TTS on complete responses — the full text is already on screen, user taps "play", audio generates and plays. One stream (audio), no sync needed. The growing file approach works fine here.
For the future real-time path (TTS while LLM generates), Jin is right. Two independent streams (text + audio) need coordination so users don't hear words before they appear on screen. This requires explicit segment mapping (text chunk X → audio chunk Y) and is a separate problem from audio seeking. The growing file helps playback but doesn't solve sync. Out of scope for V1, but worth noting as future work.
Build both delivery approaches as separate API routes so we can test and compare the feel on real devices.
Monorepo: harveyai/app
Order: Backend API → Web client → iOS client
POST /api/tts/convert
Body: { text, voice_id?, model? }
Response: complete audio file (audio/mpeg)
Headers: Content-Length, Accept-Ranges: bytes, Content-Type: audio/mpeg
Flow:
- Client sends text
- Backend calls ElevenLabs, buffers entire response
- Backend responds with complete MP3 file + proper headers
- Client plays — all controls work (pause, resume, +/- 15s, scrub)
What works:
- All P0 controls trivially —
currentTime += 15just works - P1 scrub bar works perfectly — full file, known duration
- Simplest client code on both iOS (
AVAudioPlayer) and Web (<audio>)
What doesn't:
- User waits for full generation (~1s short text, ~3-5s summary, ~30s+ full document)
- Loading state needed ("Generating audio...")
Backend concerns:
- Memory: ~500KB per request for long text. Manageable with a cap.
- Timeout: long text can take 30s+ on ElevenLabs side. Need generous timeout.
POST /api/tts/stream
Body: { text, voice_id?, model? }
Response: chunked audio stream
Flow:
- Client sends text
- Backend calls ElevenLabs HTTP Stream endpoint
- Backend receives chunks, ensures MP3 frame alignment
- Backend streams assembled chunks to client
- Client builds a growing audio buffer
- +/- 15s works within received range from the start
What works:
- P0 controls: pause/resume native, +/- 15s works within received audio
- If user skips forward beyond what's received, queue the skip and execute when enough audio arrives. Backward skip always works.
- Fast time-to-first-audio (~1-2s)
- Natural foundation for future real-time LLM-to-speech
What's harder:
- P1 scrub bar — duration unknown until complete, progress bar grows over time
- MP3 frame alignment — must send complete frames
Client concerns (Web):
MediaSourceAPI — append chunks, seeking within buffered range works natively- Safari
MediaSource+ MP3 requires Safari 17.1+ (Oct 2023). Decision: do we support Safari < 17.1?
Client concerns (iOS):
AVAudioPlayerwith growingData— reload with updated data as chunks arrive- Must track
currentTimeacross reloads to maintain position - +/- 15s: check if target position is within received range. If yes, seek. If no (forward skip), wait for more data.
Since this is P0 and is the reason seeking matters, worth being explicit:
| Scenario | Route 1 (Full File) | Route 2 (Growing File) |
|---|---|---|
| Skip back 15s (always within range) | currentTime -= 15 |
currentTime -= 15 (always works) |
| Skip forward 15s (within received audio) | currentTime += 15 |
currentTime += 15 (works) |
| Skip forward 15s (beyond received audio) | Always works (full file) | Queue skip, execute when audio arrives. Show "buffering..." |
| Skip forward 15s (beyond total audio) | Clamp to end | Clamp to end of received range |
The only UX difference: Route 2 may occasionally show "buffering..." on forward skip if the user jumps ahead of what's been received. This only happens early in playback of long text.
Both routes share:
- ElevenLabs client — wrapper with auth, retry, error handling
- Voice/model config — default voice ID, Flash v2.5 model
- Text validation — length limits (40k char max), sanitization
- Rate limiting — per-user TTS request limits
- Logging/metrics — generation time, text length, error rates
Once both routes are live, test on real devices:
| Test | What to compare |
|---|---|
| Time to first audio | How long until user hears something? |
| +/- 15s feel | Does skip feel responsive on both? Does Route 2 handle forward-skip-beyond-buffer gracefully? |
| Long response | Full assistant response (~5 min audio). Is Route 1's wait acceptable? Does Route 2 stay smooth? |
| Short response | A paragraph. Does Route 1's wait even matter? |
| Error recovery | Connection drop mid-stream? Route 1 is atomic. Route 2 needs graceful degradation. |
- ElevenLabs client wrapper (shared)
/api/tts/convertendpoint (full file)/api/tts/streamendpoint (progressive chunks)- Tests for both
- Basic audio player UI (pause, resume, +/- 15 sec)
- Wire up
/api/tts/convert(simple —<audio src="blob:...">) - Wire up
/api/tts/stream(MediaSourceAPI) - Toggle between routes for comparison
- Basic audio player UI (same controls)
- Wire up
/api/tts/convert(AVAudioPlayerwithData) - Wire up
/api/tts/stream(AVAudioPlayerreload with growingData) - Toggle for comparison
When we want TTS to start while the assistant is still generating:
LLM streams tokens → Backend batches into sentences
→ Each sentence sent to ElevenLabs
→ Audio chunks stream to client per-sentence
→ Client plays sentences sequentially
Jin's sync concern applies here: Two independent streams (text appearing on screen + audio playing) need explicit coordination. This requires segment mapping (text chunk X → audio chunk Y) so audio never gets ahead of text. The growing file approach helps playback/seeking but does NOT solve this sync problem — that needs separate architecture work.
Joey's approach: Split on periods/newlines, send each chunk to TTS independently. Map each chunk to its text segment for sync.
Route 2's progressive chunks infrastructure transfers directly to this use case.
- Safari
MediaSource+ MP3 — Do we support Safari < 17.1? If yes, Route 2 needs AAC/fMP4 on backend. - Audio caching — Cache generated audio (keyed on text hash + voice + model)? Saves cost on repeat listens.
- Max text length — Flash v2.5 supports 40k chars. Do we set a lower app limit?
- Voice selection — Single default voice for V1, or expose voice picker?
harveyai/apprepo structure — Where does TTS backend code live?- Text-to-audio segment mapping — Even for V1, if we track which text produced which audio chunk, it enables P2 (text highlighting) and future sync. Worth building into the data model early?