title	Speaker Diarization for Meeting Note App — Best Options May 2026
date	2026-05-09
type	research
status	inbox
project	meeting-note-app
related	soniox-stt-service.md, soniox-context-injection.md

Speaker Diarization for Meeting Note App — May 2026

TL;DR

Your Soniox STT is excellent (Vietnamese WER 5.4% — best in class). Don't replace it. Add a separate diarization stage.

Recommendation	Why	Cost	Effort
#1 — Soniox (STT) + Pyannote AI Precision-2 (diarization), batch	Pyannote AI is independently #1 on DIHARD (11.2% DER). Soniox keeps your Vietnamese edge. Hybrid is ~30 lines of merge code.	Soniox + €0.112/hr (~$0.12)	1–2 days
#2 — Soniox + Pyannote community-1 (open source, self-host)	CC-BY-4.0 license, commercial use OK. AMI 17.0% / VoxConverse 11.2% DER. Free.	Soniox + GPU server	2–3 days
#3 — Switch entirely to AssemblyAI Universal-3 Pro Streaming	One vendor. Real-time diarization sub-300ms. 71% DER improvement on reverb in their July 2025 update. Vietnamese supported.	$0.57/hr streaming (STT+diarization)	3–5 days migration

Avoid: Google Cloud Speech-to-Text (50.2% DER on VoxConverse — broken). Rev AI (no Vietnamese). DiariZen open source (CC-BY-NC, blocks commercial use).

Why Soniox diarization is weak

From Soniox docs (speaker-diarization):

15-speaker max per session (low ceiling for boardrooms)
No speaker count hint — can't tell the API "I have 3 speakers"
No channel separation flag
Architecture not disclosed (likely cluster-based, not modern EEND)
Soniox v4 async (29 Jan 2026) cites "improved diarization" but no DER number published
In Pyannote's own DIHARD benchmark, Soniox stt-async-preview-v1 ranked below Pyannote Precision-2 (no absolute number disclosed)

You're not imagining inconsistency — Soniox's diarization stack is legitimately a tier behind specialists.

The diarization quality leaderboard (independent benchmarks only)

Cited from Picovoice 2026 benchmark and SDBench arXiv 2509.26177, Sep 2025:

Provider	DIHARD avg DER	VoxConverse DER	Mode	Vietnamese	License
Pyannote AI Precision-2	11.2%	9.0%	Batch	Lang-agnostic	Commercial €0.112/hr
Picovoice Falcon	—	10.3%	RT + batch	Yes	Commercial / on-device
AWS Transcribe	—	11.1%	Batch + streaming	Yes	$1.44–2.00/hr
AssemblyAI Universal-3	self-reported 4.4–14.4% (internal sets)	—	RT + batch	Yes	$0.57/hr (incl. STT)
Pyannote community-1 (OSS)	AMI 17.0% / VoxConv 11.2%	11.2%	Batch	Adequate	CC-BY-4.0 ✅ commercial
DiariZen-Large-s80 (OSS)	DIHARD 14.5%	9.2%	Batch	English-biased	CC-BY-NC ❌ blocks commercial
Speechmatics	self-reported 3.9% (no third-party verify)	—	RT + batch	Unconfirmed	$1.04–1.35/hr
Soniox v4	not disclosed (below Pyannote)	—	RT + batch	Yes (5.4% WER)	$0.10/hr
Deepgram Nova-3	not disclosed	—	RT + batch	Unconfirmed	$0.46–0.82/hr
Google Chirp 3	—	50.2% (broken)	Batch only	Yes	$0.96/hr
Rev AI	—	—	Batch + streaming	No	$0.18/hr

Key: Speechmatics' "3.9% DER" is self-reported on their own test sets — not in any independent benchmark. Pyannote AI's number is from third-party academic work.

Architecture options for your Meeting Note app

Option A — Hybrid (RECOMMENDED): Soniox + Pyannote AI Precision-2

Audio stream
    ├──→ Soniox WebSocket (RT) ─→ Word tokens with start_ms/end_ms
    └──→ Buffered audio file ─→ Pyannote AI batch ─→ Speaker segments [(start,end,speaker_id)]
                                                        ↓
                                  Merge layer: assign each Soniox token
                                  to the speaker segment containing its midpoint

Mechanics:

Soniox already returns start_ms/end_ms per token (Soniox timestamps docs)
Pyannote AI returns [{start, end, speaker}] segments
Merge step is interval-tree lookup → ~30 lines of Python
Run diarization after call ends → zero latency impact, transcript ready 5–30s after meeting

Cost on a 1-hour meeting:

Soniox real-time: ~$0.12
Pyannote AI Precision-2: ~$0.12
Total: ~$0.24/hr (cheaper than AssemblyAI standalone at $0.57/hr)

Effort: 1–2 days. The merge logic is the same WhisperX uses (m-bain/whisperX).

Free trial: Pyannote AI gives 150 hours free on signup — enough to validate on your real meeting recordings before committing.

Pyannote AI pricing | Community-1 announcement, 4 May 2026

Option B — Self-hosted: Soniox + Pyannote community-1

Same architecture as Option A, but run Pyannote community-1 on your own GPU server.

License: CC-BY-4.0 — commercial use allowed with attribution
DER: AMI 17.0% / VoxConverse 11.2% (same as Precision-2 on VoxConverse, slightly worse on AMI)
Compute: ~10–30x real-time on a T4 GPU (1h audio = 2–6 min processing)
Cost: just your GPU hosting (a single T4 on Oracle ARM or a Lambda spot is ~$0.50/hr active)

Effort: 2–3 days (deployment + queue management).

Use this if: You want to scale and absorb Pyannote AI fees, OR you need on-prem for data privacy.

pyannote/speaker-diarization-community-1 on HuggingFace

Option C — Single vendor: AssemblyAI Universal-3 Pro Streaming

Replace Soniox entirely with AssemblyAI.

Pros:

One API, one bill, one error surface
Real-time diarization at sub-300ms (Soniox's RT diarization is weak; this is genuinely live)
71% DER reduction on reverberant audio in their July 2025 update
99 languages including Vietnamese
Built-in summaries, sentiment, topic detection — meeting-app-native

Cons:

Vietnamese WER not independently benchmarked vs Soniox's 5.4% — you may lose transcription quality
$0.57/hr (5x Soniox's $0.10/hr base + Pyannote $0.12/hr combined)
Streaming billing is connection-time based (idle time counts)

Use this if: You decide diarization-quality during meeting (not post-call) is critical AND single-vendor simplicity beats cost.

AssemblyAI Universal-3 Pro Streaming | Pricing

Option D — Streaming hybrid: Soniox + NVIDIA Sortformer

If you need live speaker labels (not post-call), the only open-source streaming option is NVIDIA Sortformer.

Released 18 Aug 2025 (NVIDIA blog)
0.32s chunk latency
Hard cap: 4 speakers in current 4spk-v1 model — disqualifying for boardrooms
Requires NVIDIA GPU; production deployment via Riva (paid)
Trained English-dominant, Vietnamese untested

Verdict: Skip unless your meetings are always ≤4 people. Wait for next Sortformer release with higher speaker cap.

Decision framework

Three questions answer your choice:

Do you need diarization labels DURING the meeting (live), or AFTER (within 30s of call ending)?
- DURING → AssemblyAI (Option C) or Sortformer (D, if ≤4 speakers)
- AFTER → Hybrid Soniox + Pyannote (Options A/B) — best quality + cheapest
Do you keep Soniox's Vietnamese 5.4% WER, or accept untested Vietnamese WER on AssemblyAI?
- Keep Soniox → Option A/B
- Trust AssemblyAI Vietnamese (test first!) → Option C
Will you self-host or use a cloud API?
- Cloud, simplest → Option A (Soniox + Pyannote AI)
- Self-host for cost / privacy → Option B (Soniox + community-1 OSS)

My recommendation: Option A. Lowest risk, lowest cost, highest quality, and you keep your Soniox investment. The 150-hour free trial on Pyannote AI lets you validate before paying anything.

Action plan (this week)

Sign up for Pyannote AI free trial (150 hours): https://www.pyannote.ai/
Take 3 of your existing problem recordings (the ones where Soniox diarization failed)
Run them through Pyannote AI Precision-2 batch endpoint — diarization output only
Manually compare: Soniox-only diarization vs Pyannote-on-the-same-audio
Decide: if Pyannote clearly wins on your real Vietnamese meeting room audio, build the merge layer. If not, test Option B with community-1 self-hosted, or test AssemblyAI's Vietnamese in parallel.

Sources

Independent benchmarks:

Vendor pages:

Open source:

Comparisons:

hungson175/Speaker-Diarization-API-Research-20260509.md

Select an option

No results found