This article details NeMo’s diarization pipelines—ClusteringDiarizer and NeuralDiarizer (MSDD)—covering the algorithmic flow, scale definitions, fusion strategies, speaker counting, long‑form handling, and the exposed configuration surfaces, with practical examples.
- ClusteringDiarizer (unsupervised):
- VAD → cut speech into multi-scale windows ("scales") → extract speaker embeddings → fuse information across scales → estimate how many speakers → spectral clustering → RTTM.
- Output has one dominant speaker at each time (no learned overlap model).
- NeuralDiarizer (MSDD) (learned overlap model on top):
- Reuses ClusteringDiarizer’s products (speakers discovered, multiscale embeddings) and predicts, at each time step, which of the discovered speakers are active. Multiple can be active → overlap-aware RTTM.
We’ll first unpack ClusteringDiarizer step-by-step, then show how NeuralDiarizer builds on it.
-
VAD (Voice Activity Detection) — “find speech time spans”
- Input: long audio (minutes to hours). To avoid OOM, NeMo splits it into big chunks (e.g., ~50s). Adjacent chunks overlap by one VAD window for context.
- Inside each chunk, the VAD model is fully convolutional: it slides a small analysis window (e.g., 0.20s) every shift (e.g., 0.01s) and outputs a speech probability per frame.
- To remove duplicated frames at chunk borders, NeMo trims half-a-window from the overlapping ends. The trimmed pieces from adjacent chunks fit together like puzzle pieces.
Worked example (intuitive):
- window = 0.20s, shift = 0.01s → time_unit = 20 frames per window.
- start chunk keeps everything except the last 10 frames; middle chunks drop 10 at both ends; end chunk drops the first 10. The concatenation has no duplicates.
How the dense probability sequence is computed:
- Think of the model as a moving “magnifying glass” that checks every tiny step (
shift) and asks “how speech‑like is the audio around here (inside awindow)?” - For a 50 s chunk with
window=0.20 sandshift=0.01 s, the model produces about 5,000 values: one speech probability every 10 ms. - Because the network is fully convolutional over time, it computes all those probabilities for the whole chunk in a single forward pass (no Python loop). The trimming described above only removes duplicated border frames; all remaining frames are kept and then concatenated across chunks into one long list.
- Optional: a smoothing step can reproject those probabilities onto a strict 10 ms grid and combine overlapping votes by mean/median. If smoothing is off, the grid is simply the configured
shift.
From probabilities to segments (inside VAD):
- Still within the VAD stage, the dense probability sequence is converted into a binary speech timeline via hysteresis thresholds (onset/offset), optional boundary padding, and duration/gap rules (filter very short speech, fill tiny gaps). This yields contiguous speech runs.
- Outputs (produced by VAD):
<uniq>.frame: per-frame probabilities (dense).<uniq>.txt: merged speech segments — far fewer lines than.frame.vad_out.json: JSON manifest of speech spans (audio path, offset, duration). This drives where embeddings will be extracted (no speaker IDs yet).
Note: VAD segments are speaker‑agnostic. A single segment marks a contiguous region that contains speech and may include multiple speakers at different times (and even overlapping speech). These segments only gate where embeddings are extracted; speaker attribution is decided later by clustering (and MSDD if enabled).
-
Multiscale segmentation — “see speech through several lenses”
- A scale is just a slicing recipe:
(window, shift). Longer windows are stable/coarse; shorter windows are reactive/fine. - NeMo slices only within VAD spans. For each scale
kit createsspeaker_outputs/subsegments_scale{k}.json, where each line is one subsegment:{audio, offset, duration}. - Subsegments are fixed max-length: most have length
window; the last one in a VAD span may be shorter (truncate at span end). Very short tails are dropped bymin_subsegment_duration. - The shortest window is the “base scale.” Its subsegments form the master timeline where final labels will live.
Example scales:
window = [1.5, 1.0, 0.5, 0.25],shift = [0.75, 0.5, 0.25, 0.125](descending). Scale 3 (0.25/0.125) is the base.
- A scale is just a slicing recipe:
-
Speaker embeddings
- For every subsegment in each scale, extract a speaker embedding using
EncDecSpeakerLabelModel(e.g.,titanet_large). - Save per-scale embeddings and per-subsegment timestamps in memory; optionally save
.pklunderspeaker_outputs/embeddings/. - Internals: embedders collapse time with temporal pooling (e.g., statistics pooling or attentive statistics pooling) to produce one fixed‑dimensional vector per subsegment window. Longer windows pool more heterogeneous frames and can smooth away timbral cues; shorter windows are noisier. Mid‑length windows usually strike the best identity/robustness trade‑off.
- For every subsegment in each scale, extract a speaker embedding using
-
Align scales to base scale
- Compute the center time of each subsegment. For each longer‑scale subsegment, pick the closest base‑scale subsegment by absolute distance between centers.
- Intuition: “for this long window, which base slice is at the same time?” The long window is then treated as repeating for however many base slices map to it.
-
Fuse information across scales NeMo uses one of two strategies:
- Affinity fusion (short‑form) — “blend the relationship graphs”
- Per scale, compute a cosine‑similarity matrix and min–max normalize it.
- Expand that matrix to base‑scale size using the alignment mapping (long windows are repeated over the base indices they cover).
- Weighted sum across scales (
multiscale_weights) → one fused affinity for all base subsegments. - Why this is nice: you preserve each scale’s geometry after normalization; small scales contribute reactiveness; long scales contribute stability.
- Feature (embedding) fusion (long‑form) — “blend the descriptors first”
- For each base subsegment, compute a weighted sum of the aligned per‑scale embeddings to get a single fused embedding at the base scale.
- Within windows, build an affinity on these fused embeddings.
- Why here: avoids building one huge global N×N matrix when N is very large.
- Affinity fusion (short‑form) — “blend the relationship graphs”
-
Estimate number of speakers (NME‑SC) and cluster
- Build a k‑NN graph from the affinity (keep top‑p neighbors per row, symmetrize). The NME‑SC procedure (Normalized Maximum Eigengap Spectral Clustering) chooses
pand the number of speakersKby looking at Laplacian eigen‑gaps. - Intuition: if speakers are well separated, the graph has
Kwell‑formed communities, and the eigenvalues show a big jump betweenλ_Kandλ_{K+1}. - After picking
K, perform spectral clustering (compute spectral embeddings → k‑means++ with several random seeds → majority vote) to assign a speaker label to each base subsegment.
- Build a k‑NN graph from the affinity (keep top‑p neighbors per row, symmetrize). The NME‑SC procedure (Normalized Maximum Eigengap Spectral Clustering) chooses
-
Write outputs
- Merge contiguous same-speaker base subsegments into speaker turns and write RTTM under
pred_rttms/. - Save a single label file at the base scale:
speaker_outputs/subsegments_scale{base}_cluster.label(includes raw per-subsegment [start end speaker_k] for all sessions). There is only one cluster-label file because clustering operates at the base scale.
- Merge contiguous same-speaker base subsegments into speaker turns and write RTTM under
- For very long audio (many base subsegments), building one global N×N affinity is expensive. NeMo switches to a long-form mode:
- Split the base fused embeddings into windows of size
embeddings_per_chunk. - Overcluster each window into
chunk_cluster_countlocal groups (more than expected speakers) using a local M×M affinity. - Replace each local group by its centroid (and remember which original indices map to it).
- Concatenate all centroids and run one global clustering to get labels for centroids.
- Unpack: assign each original subsegment the label of its group’s centroid.
- Split the base fused embeddings into windows of size
This reduces time/memory from O(N²) to roughly O(N·M + C²) with M=embeddings_per_chunk and C≪N (number of centroids).
-
VAD
diarizer.vad.model_pathdiarizer.vad.parameters.window_length_in_sec,shift_length_in_sec,smoothing,overlap,onset,offset,min_duration_on,min_duration_off, padding
-
Multiscale segmentation / embeddings
diarizer.speaker_embeddings.model_path(e.g.,titanet_large)diarizer.speaker_embeddings.parameters.window_length_in_sec(list or float)diarizer.speaker_embeddings.parameters.shift_length_in_sec(list or float)diarizer.speaker_embeddings.parameters.multiscale_weights
-
Clustering (counting and spectral)
diarizer.clustering.parameters.max_num_speakersdiarizer.clustering.parameters.oracle_num_speakers(boolean; usesnum_speakersfrom manifest when true)diarizer.clustering.parameters.max_rp_threshold(upper bound for p/N)diarizer.clustering.parameters.sparse_search_volume(how many p values to try when sparse search is used)- Long-form only:
embeddings_per_chunk,chunk_cluster_count
Note on advanced NME options: the underlying clustering code supports additional knobs (
fixed_thres,sparse_search,maj_vote_spk_count,nme_mat_size). In the stock ClusteringDiarizer call site these are not forwarded from YAML by default, so changing them in config may not take effect unless you patch the call to pass them through (see “Advanced: wiring fixed_thres”).
- With default NME‑SC (no fixed neighbor ratio):
- Scan p‑neighbors (or a subset of p up to
max_rp_threshold, controlled bysparse_search_volume). - For each p, build the k‑NN graph and compute Laplacian eigenvalues.
- Pick the number of speakers as the argmax eigengap.
- Use that K for spectral clustering.
- Scan p‑neighbors (or a subset of p up to
- With a fixed neighbor ratio (advanced): set
fixed_thresto bypass the p search; the k‑NN density becomesp=floor(N·fixed_thres). Smallerfixed_thres→ sparser graph → typically more clusters (up to caps/graph connectivity). - With oracle: set
oracle_num_speakers: trueandnum_speakersin your manifest; clustering runs with that K exactly.
Concrete example (counting intuition):
- Suppose your base grid has N=10,000 subsegments. If
fixed_thres=0.12, you keep ~1,200 neighbors per row. If speakers frequently switch, a lower neighbor ratio (e.g., 0.12) often separates communities better than a very dense graph (e.g., 0.35).
-
Under‑counting (e.g., detects 7–8 vs expected 11)
- Raise
max_num_speakerswell above expected (e.g., 32–64). - Add shorter base scale and slightly increase its weight.
- Use a stronger embedder (e.g.,
titanet_large). - Broaden NME search (
max_rp_thresholdup to 0.4–0.5, increasesparse_search_volume). - Long-form: increase
chunk_cluster_count(≥80–128) and setembeddings_per_chunk~8k–12k. - Advanced: wire and sweep
fixed_thres(e.g., 0.10–0.20) withmaj_vote_spk_count=false, highernme_mat_size.
- Raise
-
Data with little/no overlap
- ClusteringDiarizer is sufficient; NeuralDiarizer won’t add speakers beyond K from clustering.
vad_outputs/*.frame: per-frame speech posteriorvad_outputs/*/.../*.medianor.mean: smoothed posteriors (optional)vad_outputs/seg_output-.../*.txt: speech segmentsspeaker_outputs/subsegments_scale{k}.json: per-scale subsegmentsspeaker_outputs/embeddings/subsegments_scale{k}_embeddings.pkl: optional saved embeddingspred_rttms/<uniq>.rttm: diarization outputspeaker_outputs/subsegments_scale{base}_cluster.label: per-base subsegment labels for all sessions
- Uses the clustering result (speaker inventory K) and per-scale embeddings to build:
- A multi-scale embedding sequence aligned to the base scale.
- Cluster‑average embeddings per speaker per scale (references).
- Runs a learned multiscale diarization decoder (MSDD) that predicts, at each base time step, a probability per speaker of being active. Multiple speakers can be active → overlap-aware diarization.
-
Prepare encodings via ClusteringDiarizer
ClusterEmbedding.prepare_cluster_embs_infer()runs the clustering pipeline to obtain subsegments, embeddings, base-scale labels, and scale mapping.- Compute cluster‑average embeddings per speaker/scale using the base labels and cross-scale mapping.
-
Pairwise decoder
- MSDD is trained to predict activity for 2 speakers at a time (
num_spks_per_model=2). - For K clustered speakers, run the decoder for all speaker pairs, producing pairwise time‑series.
- Merge (average) the pairwise results into a single
[T, K]matrix of per‑speaker probabilities.
Intuition: think of it as asking “of this pair, who’s speaking now?” for every pair, then combining the answers to get a per‑speaker probability track.
- MSDD is trained to predict activity for 2 speakers at a time (
-
Postprocessing and RTTM
- Threshold per speaker (optionally adaptive by K), optionally combine with clustering labels (
use_clus_as_main), limit max overlap speakers, and render overlap-aware RTTMs.
- Threshold per speaker (optionally adaptive by K), optionally combine with clustering labels (
- MSDD does not create new speakers; it refines activity among the K speakers discovered by clustering. If K is under-estimated, fix clustering first (or use oracle K).
- Use NeuralDiarizer when overlap is frequent or you want learned smoothing of per‑speaker activity. For mostly non‑overlapping speech, ClusteringDiarizer typically suffices.
infer_batch_size,seq_eval_modesigmoid_threshold(list to sweep),use_adaptive_thresuse_clus_as_main,max_overlap_spks,overlap_infer_spk_limitsplit_infer,diar_window_length(split and stitch very long sessions)
The clustering backend supports fixed_thres (bypass p‑search) and other knobs (sparse_search, maj_vote_spk_count, nme_mat_size). In the stock perform_clustering(...) call these are not forwarded from the diarizer YAML. To use them, patch the call to LongFormSpeakerClustering.forward_infer(...) and pass, for example:
fixed_thres=float(clustering_params.get('fixed_thres', -1.0))Then add fixed_thres to your config under diarizer.clustering.parameters. Smaller values (~0.10–0.20) tend to produce more clusters (sparser graph), up to max_num_speakers. Increase nme_mat_size to reduce subsampling effects.
- Ensure many base-scale subsegments: shorter windows and smaller shifts increase segment density, which helps counting and boundary accuracy.
- Prefer stronger embedding models (
titanet_large) for mixed or noisy domains. - For long recordings, raise
chunk_cluster_countand chooseembeddings_per_chunkthat fits memory. - Always sanity-check with
oracle_num_speakers=trueonce; if the quality is good at the correct K, focus on counting (NME) rather than separation.
Minimal multiscale setup (stable + reactive):
diarizer:
speaker_embeddings:
model_path: titanet_large
parameters:
window_length_in_sec: [1.5, 1.0, 0.5, 0.25]
shift_length_in_sec: [0.75, 0.5, 0.25, 0.125]
multiscale_weights: [1.0, 1.0, 1.0, 1.2]
clustering:
parameters:
max_num_speakers: 32
max_rp_threshold: 0.25
sparse_search_volume: 60
embeddings_per_chunk: 10000
chunk_cluster_count: 80Counting underestimates K? (advanced; after wiring fixed_thres):
diarizer:
clustering:
parameters:
max_num_speakers: 64
# Bypass NME search and set p ≈ fixed_thres · N
fixed_thres: 0.14 # sweep 0.10–0.20
# Optional to make the fixed ratio more faithful:
# nme_mat_size: 2000
embeddings_per_chunk: 10000
chunk_cluster_count: 128Oracle K sanity check:
diarizer:
clustering:
parameters:
oracle_num_speakers: true
# In your manifest, add per file: "num_speakers": 11- Scale: subsegment slicing rule
(window, shift)applied only on VAD spans. Shortest scale = base. - Base scale: shortest window; the final label timeline.
- Affinity fusion: weighted sum of per-scale similarity matrices, expanded to base size.
- Feature fusion: weighted sum of per-scale embeddings → single fused base embedding sequence.
- NME‑SC: method that picks k‑NN density and speaker count by maximizing Laplacian eigengap.
- Overclustering: local “too many clusters” per window to produce pure centroids for cheap global clustering.
- MSDD: multiscale diarization decoder predicting per‑speaker activity over time; overlap capable; uses speakers discovered by clustering.
- In practice we observed that male voices are differentiated more easily, whereas multiple female speakers frequently collapse into a single cluster.
- Likely causes:
- Embedding space geometry: female voices occupy a tighter acoustic region (F0/formants), making cosine affinities look too similar; communities blur and NME‑SC undercounts.
- Graph density: denser k‑NN graphs (larger p) connect speakers across communities; eigengap then favors fewer clusters.
- Segmentation extremes: very short base windows emphasize transient similarities; very long windows average timbre and mask differences.
- Mitigations:
- Prefer strong embedders (e.g.,
titanet_large) or domain‑tuned models with balanced female data. - Rebalance scale weights toward mid windows (e.g., 0.5–1.5 s) and down‑weight extremes.
- In long‑form, raise
chunk_cluster_count(e.g., ≥128) to avoid premature merges. - If possible, expose and sweep
fixed_thresto enforce sparser k‑NN graphs (≈0.10–0.16), which tends to separate communities.
- Prefer strong embedders (e.g.,
- Applying the tips and the tuning playbook typically resulted in only small changes to the estimated number of speakers. In particular, introducing very short base scales (< 1.0 s) often decreased the estimated count and degraded overall diarization on my data.
- In practice, the estimated speaker count is largely governed by NME‑SC’s eigengap analysis. Usual tuning affects it marginally; it rarely jumps from, say, 7–8 to 11 on its own.
- To materially steer the count, it is often necessary to expose and use parameters like
fixed_thres(and related knobs), i.e., modify the code path to bypass the standard NME scan and set the graph density directly.