Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save hungson175/4e78baf66703da60645d5492f2d2980a to your computer and use it in GitHub Desktop.

Select an option

Save hungson175/4e78baf66703da60645d5492f2d2980a to your computer and use it in GitHub Desktop.
Speaker Diarization API Research — May 2026 (for Meeting Note app)
title Speaker Diarization for Meeting Note App — Best Options May 2026
date 2026-05-09
type research
status inbox
project meeting-note-app
related soniox-stt-service.md, soniox-context-injection.md

Speaker Diarization for Meeting Note App — May 2026

TL;DR

Your Soniox STT is excellent (Vietnamese WER 5.4% — best in class). Don't replace it. Add a separate diarization stage.

Recommendation Why Cost Effort
#1 — Soniox (STT) + Pyannote AI Precision-2 (diarization), batch Pyannote AI is independently #1 on DIHARD (11.2% DER). Soniox keeps your Vietnamese edge. Hybrid is ~30 lines of merge code. Soniox + €0.112/hr (~$0.12) 1–2 days
#2 — Soniox + Pyannote community-1 (open source, self-host) CC-BY-4.0 license, commercial use OK. AMI 17.0% / VoxConverse 11.2% DER. Free. Soniox + GPU server 2–3 days
#3 — Switch entirely to AssemblyAI Universal-3 Pro Streaming One vendor. Real-time diarization sub-300ms. 71% DER improvement on reverb in their July 2025 update. Vietnamese supported. $0.57/hr streaming (STT+diarization) 3–5 days migration

Avoid: Google Cloud Speech-to-Text (50.2% DER on VoxConverse — broken). Rev AI (no Vietnamese). DiariZen open source (CC-BY-NC, blocks commercial use).


Why Soniox diarization is weak

From Soniox docs (speaker-diarization):

  • 15-speaker max per session (low ceiling for boardrooms)
  • No speaker count hint — can't tell the API "I have 3 speakers"
  • No channel separation flag
  • Architecture not disclosed (likely cluster-based, not modern EEND)
  • Soniox v4 async (29 Jan 2026) cites "improved diarization" but no DER number published
  • In Pyannote's own DIHARD benchmark, Soniox stt-async-preview-v1 ranked below Pyannote Precision-2 (no absolute number disclosed)

You're not imagining inconsistency — Soniox's diarization stack is legitimately a tier behind specialists.


The diarization quality leaderboard (independent benchmarks only)

Cited from Picovoice 2026 benchmark and SDBench arXiv 2509.26177, Sep 2025:

Provider DIHARD avg DER VoxConverse DER Mode Vietnamese License
Pyannote AI Precision-2 11.2% 9.0% Batch Lang-agnostic Commercial €0.112/hr
Picovoice Falcon 10.3% RT + batch Yes Commercial / on-device
AWS Transcribe 11.1% Batch + streaming Yes $1.44–2.00/hr
AssemblyAI Universal-3 self-reported 4.4–14.4% (internal sets) RT + batch Yes $0.57/hr (incl. STT)
Pyannote community-1 (OSS) AMI 17.0% / VoxConv 11.2% 11.2% Batch Adequate CC-BY-4.0 ✅ commercial
DiariZen-Large-s80 (OSS) DIHARD 14.5% 9.2% Batch English-biased CC-BY-NC ❌ blocks commercial
Speechmatics self-reported 3.9% (no third-party verify) RT + batch Unconfirmed $1.04–1.35/hr
Soniox v4 not disclosed (below Pyannote) RT + batch Yes (5.4% WER) $0.10/hr
Deepgram Nova-3 not disclosed RT + batch Unconfirmed $0.46–0.82/hr
Google Chirp 3 50.2% (broken) Batch only Yes $0.96/hr
Rev AI Batch + streaming No $0.18/hr

Key: Speechmatics' "3.9% DER" is self-reported on their own test sets — not in any independent benchmark. Pyannote AI's number is from third-party academic work.


Architecture options for your Meeting Note app

Option A — Hybrid (RECOMMENDED): Soniox + Pyannote AI Precision-2

Audio stream
    ├──→ Soniox WebSocket (RT) ─→ Word tokens with start_ms/end_ms
    └──→ Buffered audio file ─→ Pyannote AI batch ─→ Speaker segments [(start,end,speaker_id)]
                                                        ↓
                                  Merge layer: assign each Soniox token
                                  to the speaker segment containing its midpoint

Mechanics:

  • Soniox already returns start_ms/end_ms per token (Soniox timestamps docs)
  • Pyannote AI returns [{start, end, speaker}] segments
  • Merge step is interval-tree lookup → ~30 lines of Python
  • Run diarization after call ends → zero latency impact, transcript ready 5–30s after meeting

Cost on a 1-hour meeting:

  • Soniox real-time: ~$0.12
  • Pyannote AI Precision-2: ~$0.12
  • Total: ~$0.24/hr (cheaper than AssemblyAI standalone at $0.57/hr)

Effort: 1–2 days. The merge logic is the same WhisperX uses (m-bain/whisperX).

Free trial: Pyannote AI gives 150 hours free on signup — enough to validate on your real meeting recordings before committing.

Pyannote AI pricing | Community-1 announcement, 4 May 2026


Option B — Self-hosted: Soniox + Pyannote community-1

Same architecture as Option A, but run Pyannote community-1 on your own GPU server.

  • License: CC-BY-4.0 — commercial use allowed with attribution
  • DER: AMI 17.0% / VoxConverse 11.2% (same as Precision-2 on VoxConverse, slightly worse on AMI)
  • Compute: ~10–30x real-time on a T4 GPU (1h audio = 2–6 min processing)
  • Cost: just your GPU hosting (a single T4 on Oracle ARM or a Lambda spot is ~$0.50/hr active)

Effort: 2–3 days (deployment + queue management).

Use this if: You want to scale and absorb Pyannote AI fees, OR you need on-prem for data privacy.

pyannote/speaker-diarization-community-1 on HuggingFace


Option C — Single vendor: AssemblyAI Universal-3 Pro Streaming

Replace Soniox entirely with AssemblyAI.

Pros:

  • One API, one bill, one error surface
  • Real-time diarization at sub-300ms (Soniox's RT diarization is weak; this is genuinely live)
  • 71% DER reduction on reverberant audio in their July 2025 update
  • 99 languages including Vietnamese
  • Built-in summaries, sentiment, topic detection — meeting-app-native

Cons:

  • Vietnamese WER not independently benchmarked vs Soniox's 5.4% — you may lose transcription quality
  • $0.57/hr (5x Soniox's $0.10/hr base + Pyannote $0.12/hr combined)
  • Streaming billing is connection-time based (idle time counts)

Use this if: You decide diarization-quality during meeting (not post-call) is critical AND single-vendor simplicity beats cost.

AssemblyAI Universal-3 Pro Streaming | Pricing


Option D — Streaming hybrid: Soniox + NVIDIA Sortformer

If you need live speaker labels (not post-call), the only open-source streaming option is NVIDIA Sortformer.

  • Released 18 Aug 2025 (NVIDIA blog)
  • 0.32s chunk latency
  • Hard cap: 4 speakers in current 4spk-v1 model — disqualifying for boardrooms
  • Requires NVIDIA GPU; production deployment via Riva (paid)
  • Trained English-dominant, Vietnamese untested

Verdict: Skip unless your meetings are always ≤4 people. Wait for next Sortformer release with higher speaker cap.


Decision framework

Three questions answer your choice:

  1. Do you need diarization labels DURING the meeting (live), or AFTER (within 30s of call ending)?

    • DURING → AssemblyAI (Option C) or Sortformer (D, if ≤4 speakers)
    • AFTER → Hybrid Soniox + Pyannote (Options A/B) — best quality + cheapest
  2. Do you keep Soniox's Vietnamese 5.4% WER, or accept untested Vietnamese WER on AssemblyAI?

    • Keep Soniox → Option A/B
    • Trust AssemblyAI Vietnamese (test first!) → Option C
  3. Will you self-host or use a cloud API?

    • Cloud, simplest → Option A (Soniox + Pyannote AI)
    • Self-host for cost / privacy → Option B (Soniox + community-1 OSS)

My recommendation: Option A. Lowest risk, lowest cost, highest quality, and you keep your Soniox investment. The 150-hour free trial on Pyannote AI lets you validate before paying anything.


Action plan (this week)

  1. Sign up for Pyannote AI free trial (150 hours): https://www.pyannote.ai/
  2. Take 3 of your existing problem recordings (the ones where Soniox diarization failed)
  3. Run them through Pyannote AI Precision-2 batch endpoint — diarization output only
  4. Manually compare: Soniox-only diarization vs Pyannote-on-the-same-audio
  5. Decide: if Pyannote clearly wins on your real Vietnamese meeting room audio, build the merge layer. If not, test Option B with community-1 self-hosted, or test AssemblyAI's Vietnamese in parallel.

Sources

Independent benchmarks:

Vendor pages:

Open source:

Comparisons:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment