| title | Speaker Diarization for Meeting Note App — Best Options May 2026 |
|---|---|
| date | 2026-05-09 |
| type | research |
| status | inbox |
| project | meeting-note-app |
| related | soniox-stt-service.md, soniox-context-injection.md |
Your Soniox STT is excellent (Vietnamese WER 5.4% — best in class). Don't replace it. Add a separate diarization stage.
| Recommendation | Why | Cost | Effort |
|---|---|---|---|
| #1 — Soniox (STT) + Pyannote AI Precision-2 (diarization), batch | Pyannote AI is independently #1 on DIHARD (11.2% DER). Soniox keeps your Vietnamese edge. Hybrid is ~30 lines of merge code. | Soniox + €0.112/hr (~$0.12) | 1–2 days |
| #2 — Soniox + Pyannote community-1 (open source, self-host) | CC-BY-4.0 license, commercial use OK. AMI 17.0% / VoxConverse 11.2% DER. Free. | Soniox + GPU server | 2–3 days |
| #3 — Switch entirely to AssemblyAI Universal-3 Pro Streaming | One vendor. Real-time diarization sub-300ms. 71% DER improvement on reverb in their July 2025 update. Vietnamese supported. | $0.57/hr streaming (STT+diarization) | 3–5 days migration |
Avoid: Google Cloud Speech-to-Text (50.2% DER on VoxConverse — broken). Rev AI (no Vietnamese). DiariZen open source (CC-BY-NC, blocks commercial use).
From Soniox docs (speaker-diarization):
- 15-speaker max per session (low ceiling for boardrooms)
- No speaker count hint — can't tell the API "I have 3 speakers"
- No channel separation flag
- Architecture not disclosed (likely cluster-based, not modern EEND)
- Soniox v4 async (29 Jan 2026) cites "improved diarization" but no DER number published
- In Pyannote's own DIHARD benchmark, Soniox
stt-async-preview-v1ranked below Pyannote Precision-2 (no absolute number disclosed)
You're not imagining inconsistency — Soniox's diarization stack is legitimately a tier behind specialists.
Cited from Picovoice 2026 benchmark and SDBench arXiv 2509.26177, Sep 2025:
| Provider | DIHARD avg DER | VoxConverse DER | Mode | Vietnamese | License |
|---|---|---|---|---|---|
| Pyannote AI Precision-2 | 11.2% | 9.0% | Batch | Lang-agnostic | Commercial €0.112/hr |
| Picovoice Falcon | — | 10.3% | RT + batch | Yes | Commercial / on-device |
| AWS Transcribe | — | 11.1% | Batch + streaming | Yes | $1.44–2.00/hr |
| AssemblyAI Universal-3 | self-reported 4.4–14.4% (internal sets) | — | RT + batch | Yes | $0.57/hr (incl. STT) |
| Pyannote community-1 (OSS) | AMI 17.0% / VoxConv 11.2% | 11.2% | Batch | Adequate | CC-BY-4.0 ✅ commercial |
| DiariZen-Large-s80 (OSS) | DIHARD 14.5% | 9.2% | Batch | English-biased | CC-BY-NC ❌ blocks commercial |
| Speechmatics | self-reported 3.9% (no third-party verify) | — | RT + batch | Unconfirmed | $1.04–1.35/hr |
| Soniox v4 | not disclosed (below Pyannote) | — | RT + batch | Yes (5.4% WER) | $0.10/hr |
| Deepgram Nova-3 | not disclosed | — | RT + batch | Unconfirmed | $0.46–0.82/hr |
| Google Chirp 3 | — | 50.2% (broken) | Batch only | Yes | $0.96/hr |
| Rev AI | — | — | Batch + streaming | No | $0.18/hr |
Key: Speechmatics' "3.9% DER" is self-reported on their own test sets — not in any independent benchmark. Pyannote AI's number is from third-party academic work.
Audio stream
├──→ Soniox WebSocket (RT) ─→ Word tokens with start_ms/end_ms
└──→ Buffered audio file ─→ Pyannote AI batch ─→ Speaker segments [(start,end,speaker_id)]
↓
Merge layer: assign each Soniox token
to the speaker segment containing its midpoint
Mechanics:
- Soniox already returns
start_ms/end_msper token (Soniox timestamps docs) - Pyannote AI returns
[{start, end, speaker}]segments - Merge step is interval-tree lookup → ~30 lines of Python
- Run diarization after call ends → zero latency impact, transcript ready 5–30s after meeting
Cost on a 1-hour meeting:
- Soniox real-time: ~$0.12
- Pyannote AI Precision-2: ~$0.12
- Total: ~$0.24/hr (cheaper than AssemblyAI standalone at $0.57/hr)
Effort: 1–2 days. The merge logic is the same WhisperX uses (m-bain/whisperX).
Free trial: Pyannote AI gives 150 hours free on signup — enough to validate on your real meeting recordings before committing.
Pyannote AI pricing | Community-1 announcement, 4 May 2026
Same architecture as Option A, but run Pyannote community-1 on your own GPU server.
- License: CC-BY-4.0 — commercial use allowed with attribution
- DER: AMI 17.0% / VoxConverse 11.2% (same as Precision-2 on VoxConverse, slightly worse on AMI)
- Compute: ~10–30x real-time on a T4 GPU (1h audio = 2–6 min processing)
- Cost: just your GPU hosting (a single T4 on Oracle ARM or a Lambda spot is ~$0.50/hr active)
Effort: 2–3 days (deployment + queue management).
Use this if: You want to scale and absorb Pyannote AI fees, OR you need on-prem for data privacy.
pyannote/speaker-diarization-community-1 on HuggingFace
Replace Soniox entirely with AssemblyAI.
Pros:
- One API, one bill, one error surface
- Real-time diarization at sub-300ms (Soniox's RT diarization is weak; this is genuinely live)
- 71% DER reduction on reverberant audio in their July 2025 update
- 99 languages including Vietnamese
- Built-in summaries, sentiment, topic detection — meeting-app-native
Cons:
- Vietnamese WER not independently benchmarked vs Soniox's 5.4% — you may lose transcription quality
- $0.57/hr (5x Soniox's $0.10/hr base + Pyannote $0.12/hr combined)
- Streaming billing is connection-time based (idle time counts)
Use this if: You decide diarization-quality during meeting (not post-call) is critical AND single-vendor simplicity beats cost.
AssemblyAI Universal-3 Pro Streaming | Pricing
If you need live speaker labels (not post-call), the only open-source streaming option is NVIDIA Sortformer.
- Released 18 Aug 2025 (NVIDIA blog)
- 0.32s chunk latency
- Hard cap: 4 speakers in current
4spk-v1model — disqualifying for boardrooms - Requires NVIDIA GPU; production deployment via Riva (paid)
- Trained English-dominant, Vietnamese untested
Verdict: Skip unless your meetings are always ≤4 people. Wait for next Sortformer release with higher speaker cap.
Three questions answer your choice:
-
Do you need diarization labels DURING the meeting (live), or AFTER (within 30s of call ending)?
- DURING → AssemblyAI (Option C) or Sortformer (D, if ≤4 speakers)
- AFTER → Hybrid Soniox + Pyannote (Options A/B) — best quality + cheapest
-
Do you keep Soniox's Vietnamese 5.4% WER, or accept untested Vietnamese WER on AssemblyAI?
- Keep Soniox → Option A/B
- Trust AssemblyAI Vietnamese (test first!) → Option C
-
Will you self-host or use a cloud API?
- Cloud, simplest → Option A (Soniox + Pyannote AI)
- Self-host for cost / privacy → Option B (Soniox + community-1 OSS)
My recommendation: Option A. Lowest risk, lowest cost, highest quality, and you keep your Soniox investment. The 150-hour free trial on Pyannote AI lets you validate before paying anything.
- Sign up for Pyannote AI free trial (150 hours): https://www.pyannote.ai/
- Take 3 of your existing problem recordings (the ones where Soniox diarization failed)
- Run them through Pyannote AI Precision-2 batch endpoint — diarization output only
- Manually compare: Soniox-only diarization vs Pyannote-on-the-same-audio
- Decide: if Pyannote clearly wins on your real Vietnamese meeting room audio, build the merge layer. If not, test Option B with community-1 self-hosted, or test AssemblyAI's Vietnamese in parallel.
Independent benchmarks:
- State of Speaker Diarization 2026, Picovoice (updated 11/03/2026)
- Benchmarking Diarization Models, arXiv 2509.26177, Sep 2025
- SDBench, arXiv 2507.16136, Aug 2025
Vendor pages:
- Pyannote AI Pricing | Community-1 blog, 4 May 2026
- Soniox Speaker Diarization docs | v4 Async release, 29 Jan 2026
- AssemblyAI Speaker Tracking Update, 16 Jul 2025 | Universal-3 Pro Streaming
- Speechmatics Diarization docs | Pricing
- Deepgram Next-Gen Diarization
- NVIDIA Streaming Sortformer, Aug 2025
Open source:
- pyannote/speaker-diarization-community-1, HuggingFace
- WhisperX (merge pattern reference)
- 3D-Speaker (Apache 2.0 alternative)
Comparisons: