Date: 2024-12-07
Model: ResembleAI/chatterbox
| Metric | Value |
|---|---|
| Hardware | AMD Radeon RX 7700 XT (12GB VRAM, gfx1101) |
| Test audio length | 2.5 seconds |
| Generation time | ~28 seconds |
| Real-time factor (RTF) | 11.2x |
| Metric | Value |
|---|---|
| Cold start | ~43 seconds (model loading) |
| Test audio length | 5.3 seconds |
| Generation time (warm) | 5.2 seconds |
| Real-time factor (RTF) | ~1x (real-time!) |
| Platform | RTF | Estimated Time |
|---|---|---|
| Local (AMD ROCm) | 11.2x | ~5.6 hours |
| Modal (warm) | ~1x | ~30-35 minutes |
Modal is ~11x faster than local AMD ROCm.
- ROCm/HIP is working but with workspace memory warnings
- Suboptimal execution paths due to memory constraints
- ROCm typically 30-50% slower than NVIDIA CUDA
- NVIDIA A10G provides much faster inference
- Memory snapshot feature reduces cold starts
- ~$0.76/hr for A10G (pay per second)
- Scales to 10 concurrent requests per container
For a 30-minute episode generating ~35 minutes of compute:
- A10G cost: ~$0.76/hr
- Estimated cost per episode: ~$0.45
| Option | Speed (RTF) | Quality | Cost/Episode (30min) |
|---|---|---|---|
| Chatterbox (Local ROCm) | ~11x | Excellent | Free (5.6 hrs time) |
| Chatterbox (Modal) | ~1x | Excellent | ~$0.45 |
| Edge-TTS | <0.1x | Good | Free |
| OpenAI TTS | ~0.2x | Excellent | ~$15-20 |
| ElevenLabs | ~0.3x | Best | ~$5-22/mo |
Modal is the clear winner for Chatterbox TTS:
- 11x faster than local AMD ROCm
- Reasonable cost (~$0.45/episode)
- Serverless (no infrastructure to maintain)
- Scales automatically
Local ROCm is only viable for:
- Very short clips
- Overnight batch processing
- Zero-cost requirements
# Deploy
modal deploy chatterbox_tts.py
# Endpoints
POST /generate - Single TTS segment
POST /episode - Full episode with multiple segmentsContainer: chatterbox-tts
Image: rocm/pytorch:latest
GPU Override: HSA_OVERRIDE_GFX_VERSION=11.0.1GPU: a10g
Concurrency: 10 requests/container
Scaledown: 5 minutes
Memory Snapshot: enabled