This article details NeMo’s diarization pipelines—ClusteringDiarizer and NeuralDiarizer (MSDD)—covering the algorithmic flow, scale definitions, fusion strategies, speaker counting, long‑form handling, and the exposed configuration surfaces, with practical examples.
- ClusteringDiarizer (unsupervised):
- VAD → cut speech into multi-scale windows ("scales") → extract speaker embeddings → fuse information across scales → estimate how many speakers → spectral clustering → RTTM.
- Output has one dominant speaker at each time (no learned overlap model).
- NeuralDiarizer (MSDD) (learned overlap model on top):
- Reuses ClusteringDiarizer’s products (speakers discovered, multiscale embeddings) and predicts, at each time step, which of the discovered speakers are active. Multiple can be active → overlap-aware RTTM.