doc about Nemo diarization pipeline

How NeMo Diarizer Works

This article details NeMo’s diarization pipelines—ClusteringDiarizer and NeuralDiarizer (MSDD)—covering the algorithmic flow, scale definitions, fusion strategies, speaker counting, long‑form handling, and the exposed configuration surfaces, with practical examples.

Two pipelines, one foundation

ClusteringDiarizer (unsupervised):
- VAD → cut speech into multi-scale windows ("scales") → extract speaker embeddings → fuse information across scales → estimate how many speakers → spectral clustering → RTTM.
- Output has one dominant speaker at each time (no learned overlap model).
NeuralDiarizer (MSDD) (learned overlap model on top):
- Reuses ClusteringDiarizer’s products (speakers discovered, multiscale embeddings) and predicts, at each time step, which of the discovered speakers are active. Multiple can be active → overlap-aware RTTM.

We’ll first unpack ClusteringDiarizer step-by-step, then show how NeuralDiarizer builds on it.

ClusteringDiarizer

Pipeline steps

VAD (Voice Activity Detection) — “find speech time spans”
- Input: long audio (minutes to hours). To avoid OOM, NeMo splits it into big chunks (e.g., ~50s). Adjacent chunks overlap by one VAD window for context.
- Inside each chunk, the VAD model is fully convolutional: it slides a small analysis window (e.g., 0.20s) every shift (e.g., 0.01s) and outputs a speech probability per frame.
- To remove duplicated frames at chunk borders, NeMo trims half-a-window from the overlapping ends. The trimmed pieces from adjacent chunks fit together like puzzle pieces.
Worked example (intuitive):
- window = 0.20s, shift = 0.01s → time_unit = 20 frames per window.
- start chunk keeps everything except the last 10 frames; middle chunks drop 10 at both ends; end chunk drops the first 10. The concatenation has no duplicates.
How the dense probability sequence is computed:
- Think of the model as a moving “magnifying glass” that checks every tiny step (shift) and asks “how speech‑like is the audio around here (inside a window)?”
- For a 50 s chunk with window=0.20 s and shift=0.01 s, the model produces about 5,000 values: one speech probability every 10 ms.
- Because the network is fully convolutional over time, it computes all those probabilities for the whole chunk in a single forward pass (no Python loop). The trimming described above only removes duplicated border frames; all remaining frames are kept and then concatenated across chunks into one long list.
- Optional: a smoothing step can reproject those probabilities onto a strict 10 ms grid and combine overlapping votes by mean/median. If smoothing is off, the grid is simply the configured shift.
From probabilities to segments (inside VAD):
- Still within the VAD stage, the dense probability sequence is converted into a binary speech timeline via hysteresis thresholds (onset/offset), optional boundary padding, and duration/gap rules (filter very short speech, fill tiny gaps). This yields contiguous speech runs.
- Outputs (produced by VAD):
  - <uniq>.frame: per-frame probabilities (dense).
  - <uniq>.txt: merged speech segments — far fewer lines than .frame.
  - vad_out.json: JSON manifest of speech spans (audio path, offset, duration). This drives where embeddings will be extracted (no speaker IDs yet).
Note: VAD segments are speaker‑agnostic. A single segment marks a contiguous region that contains speech and may include multiple speakers at different times (and even overlapping speech). These segments only gate where embeddings are extracted; speaker attribution is decided later by clustering (and MSDD if enabled).
Multiscale segmentation — “see speech through several lenses”
- A scale is just a slicing recipe: (window, shift). Longer windows are stable/coarse; shorter windows are reactive/fine.
- NeMo slices only within VAD spans. For each scale k it creates speaker_outputs/subsegments_scale{k}.json, where each line is one subsegment: {audio, offset, duration}.
- Subsegments are fixed max-length: most have length window; the last one in a VAD span may be shorter (truncate at span end). Very short tails are dropped by min_subsegment_duration.
- The shortest window is the “base scale.” Its subsegments form the master timeline where final labels will live.
Example scales:
- window = [1.5, 1.0, 0.5, 0.25], shift = [0.75, 0.5, 0.25, 0.125] (descending). Scale 3 (0.25/0.125) is the base.
Speaker embeddings
- For every subsegment in each scale, extract a speaker embedding using EncDecSpeakerLabelModel (e.g., titanet_large).
- Save per-scale embeddings and per-subsegment timestamps in memory; optionally save .pkl under speaker_outputs/embeddings/.
- Internals: embedders collapse time with temporal pooling (e.g., statistics pooling or attentive statistics pooling) to produce one fixed‑dimensional vector per subsegment window. Longer windows pool more heterogeneous frames and can smooth away timbral cues; shorter windows are noisier. Mid‑length windows usually strike the best identity/robustness trade‑off.
Align scales to base scale
- Compute the center time of each subsegment. For each longer‑scale subsegment, pick the closest base‑scale subsegment by absolute distance between centers.
- Intuition: “for this long window, which base slice is at the same time?” The long window is then treated as repeating for however many base slices map to it.
Fuse information across scales NeMo uses one of two strategies:
- Affinity fusion (short‑form) — “blend the relationship graphs”
  - Per scale, compute a cosine‑similarity matrix and min–max normalize it.
  - Expand that matrix to base‑scale size using the alignment mapping (long windows are repeated over the base indices they cover).
  - Weighted sum across scales (multiscale_weights) → one fused affinity for all base subsegments.
  - Why this is nice: you preserve each scale’s geometry after normalization; small scales contribute reactiveness; long scales contribute stability.
- Feature (embedding) fusion (long‑form) — “blend the descriptors first”
  - For each base subsegment, compute a weighted sum of the aligned per‑scale embeddings to get a single fused embedding at the base scale.
  - Within windows, build an affinity on these fused embeddings.
  - Why here: avoids building one huge global N×N matrix when N is very large.
Estimate number of speakers (NME‑SC) and cluster
- Build a k‑NN graph from the affinity (keep top‑p neighbors per row, symmetrize). The NME‑SC procedure (Normalized Maximum Eigengap Spectral Clustering) chooses p and the number of speakers K by looking at Laplacian eigen‑gaps.
- Intuition: if speakers are well separated, the graph has K well‑formed communities, and the eigenvalues show a big jump between λ_K and λ_{K+1}.
- After picking K, perform spectral clustering (compute spectral embeddings → k‑means++ with several random seeds → majority vote) to assign a speaker label to each base subsegment.
Write outputs
- Merge contiguous same-speaker base subsegments into speaker turns and write RTTM under pred_rttms/.
- Save a single label file at the base scale: speaker_outputs/subsegments_scale{base}_cluster.label (includes raw per-subsegment [start end speaker_k] for all sessions). There is only one cluster-label file because clustering operates at the base scale.

Short-form vs long-form

For very long audio (many base subsegments), building one global N×N affinity is expensive. NeMo switches to a long-form mode:
- Split the base fused embeddings into windows of size embeddings_per_chunk.
- Overcluster each window into chunk_cluster_count local groups (more than expected speakers) using a local M×M affinity.
- Replace each local group by its centroid (and remember which original indices map to it).
- Concatenate all centroids and run one global clustering to get labels for centroids.
- Unpack: assign each original subsegment the label of its group’s centroid.

This reduces time/memory from O(N²) to roughly O(N·M + C²) with M=embeddings_per_chunk and C≪N (number of centroids).

Exposed parameters (key ones)

VAD
- diarizer.vad.model_path
- diarizer.vad.parameters.window_length_in_sec, shift_length_in_sec, smoothing, overlap, onset, offset, min_duration_on, min_duration_off, padding
Multiscale segmentation / embeddings
- diarizer.speaker_embeddings.model_path (e.g., titanet_large)
- diarizer.speaker_embeddings.parameters.window_length_in_sec (list or float)
- diarizer.speaker_embeddings.parameters.shift_length_in_sec (list or float)
- diarizer.speaker_embeddings.parameters.multiscale_weights
Clustering (counting and spectral)
- diarizer.clustering.parameters.max_num_speakers
- diarizer.clustering.parameters.oracle_num_speakers (boolean; uses num_speakers from manifest when true)
- diarizer.clustering.parameters.max_rp_threshold (upper bound for p/N)
- diarizer.clustering.parameters.sparse_search_volume (how many p values to try when sparse search is used)
- Long-form only: embeddings_per_chunk, chunk_cluster_count

Note on advanced NME options: the underlying clustering code supports additional knobs (fixed_thres, sparse_search, maj_vote_spk_count, nme_mat_size). In the stock ClusteringDiarizer call site these are not forwarded from YAML by default, so changing them in config may not take effect unless you patch the call to pass them through (see “Advanced: wiring fixed_thres”).

How the number of speakers is decided

With default NME‑SC (no fixed neighbor ratio):
1. Scan p‑neighbors (or a subset of p up to max_rp_threshold, controlled by sparse_search_volume).
2. For each p, build the k‑NN graph and compute Laplacian eigenvalues.
3. Pick the number of speakers as the argmax eigengap.
4. Use that K for spectral clustering.
With a fixed neighbor ratio (advanced): set fixed_thres to bypass the p search; the k‑NN density becomes p=floor(N·fixed_thres). Smaller fixed_thres → sparser graph → typically more clusters (up to caps/graph connectivity).
With oracle: set oracle_num_speakers: true and num_speakers in your manifest; clustering runs with that K exactly.

Concrete example (counting intuition):

Suppose your base grid has N=10,000 subsegments. If fixed_thres=0.12, you keep ~1,200 neighbors per row. If speakers frequently switch, a lower neighbor ratio (e.g., 0.12) often separates communities better than a very dense graph (e.g., 0.35).

Tuning playbook (common scenarios)

Under‑counting (e.g., detects 7–8 vs expected 11)
1. Raise max_num_speakers well above expected (e.g., 32–64).
2. Add shorter base scale and slightly increase its weight.
3. Use a stronger embedder (e.g., titanet_large).
4. Broaden NME search (max_rp_threshold up to 0.4–0.5, increase sparse_search_volume).
5. Long-form: increase chunk_cluster_count (≥80–128) and set embeddings_per_chunk ~8k–12k.
6. Advanced: wire and sweep fixed_thres (e.g., 0.10–0.20) with maj_vote_spk_count=false, higher nme_mat_size.
Data with little/no overlap
- ClusteringDiarizer is sufficient; NeuralDiarizer won’t add speakers beyond K from clustering.

Outputs quick reference

vad_outputs/*.frame: per-frame speech posterior
vad_outputs/*/.../*.median or .mean: smoothed posteriors (optional)
vad_outputs/seg_output-.../*.txt: speech segments
speaker_outputs/subsegments_scale{k}.json: per-scale subsegments
speaker_outputs/embeddings/subsegments_scale{k}_embeddings.pkl: optional saved embeddings
pred_rttms/<uniq>.rttm: diarization output
speaker_outputs/subsegments_scale{base}_cluster.label: per-base subsegment labels for all sessions

NeuralDiarizer (MSDD)

What it adds

Uses the clustering result (speaker inventory K) and per-scale embeddings to build:
- A multi-scale embedding sequence aligned to the base scale.
- Cluster‑average embeddings per speaker per scale (references).
Runs a learned multiscale diarization decoder (MSDD) that predicts, at each base time step, a probability per speaker of being active. Multiple speakers can be active → overlap-aware diarization.

How inference works

Prepare encodings via ClusteringDiarizer
- ClusterEmbedding.prepare_cluster_embs_infer() runs the clustering pipeline to obtain subsegments, embeddings, base-scale labels, and scale mapping.
- Compute cluster‑average embeddings per speaker/scale using the base labels and cross-scale mapping.
Pairwise decoder
- MSDD is trained to predict activity for 2 speakers at a time (num_spks_per_model=2).
- For K clustered speakers, run the decoder for all speaker pairs, producing pairwise time‑series.
- Merge (average) the pairwise results into a single [T, K] matrix of per‑speaker probabilities.
Intuition: think of it as asking “of this pair, who’s speaking now?” for every pair, then combining the answers to get a per‑speaker probability track.
Postprocessing and RTTM
- Threshold per speaker (optionally adaptive by K), optionally combine with clustering labels (use_clus_as_main), limit max overlap speakers, and render overlap-aware RTTMs.

Important notes

MSDD does not create new speakers; it refines activity among the K speakers discovered by clustering. If K is under-estimated, fix clustering first (or use oracle K).
Use NeuralDiarizer when overlap is frequent or you want learned smoothing of per‑speaker activity. For mostly non‑overlapping speech, ClusteringDiarizer typically suffices.

Key MSDD parameters (in diarizer.msdd_model.parameters)

infer_batch_size, seq_eval_mode
sigmoid_threshold (list to sweep), use_adaptive_thres
use_clus_as_main, max_overlap_spks, overlap_infer_spk_limit
split_infer, diar_window_length (split and stitch very long sessions)

Advanced: wiring `fixed_thres` (optional)

The clustering backend supports fixed_thres (bypass p‑search) and other knobs (sparse_search, maj_vote_spk_count, nme_mat_size). In the stock perform_clustering(...) call these are not forwarded from the diarizer YAML. To use them, patch the call to LongFormSpeakerClustering.forward_infer(...) and pass, for example:

fixed_thres=float(clustering_params.get('fixed_thres', -1.0))

Then add fixed_thres to your config under diarizer.clustering.parameters. Smaller values (~0.10–0.20) tend to produce more clusters (sparser graph), up to max_num_speakers. Increase nme_mat_size to reduce subsampling effects.

Practical tips

Ensure many base-scale subsegments: shorter windows and smaller shifts increase segment density, which helps counting and boundary accuracy.
Prefer stronger embedding models (titanet_large) for mixed or noisy domains.
For long recordings, raise chunk_cluster_count and choose embeddings_per_chunk that fits memory.
Always sanity-check with oracle_num_speakers=true once; if the quality is good at the correct K, focus on counting (NME) rather than separation.

Quick recipes (copy/paste)

Minimal multiscale setup (stable + reactive):

diarizer:
  speaker_embeddings:
    model_path: titanet_large
    parameters:
      window_length_in_sec: [1.5, 1.0, 0.5, 0.25]
      shift_length_in_sec:  [0.75, 0.5, 0.25, 0.125]
      multiscale_weights:   [1.0, 1.0, 1.0, 1.2]
  clustering:
    parameters:
      max_num_speakers: 32
      max_rp_threshold: 0.25
      sparse_search_volume: 60
      embeddings_per_chunk: 10000
      chunk_cluster_count: 80

Counting underestimates K? (advanced; after wiring fixed_thres):

diarizer:
  clustering:
    parameters:
      max_num_speakers: 64
      # Bypass NME search and set p ≈ fixed_thres · N
      fixed_thres: 0.14   # sweep 0.10–0.20
      # Optional to make the fixed ratio more faithful:
      # nme_mat_size: 2000
      embeddings_per_chunk: 10000
      chunk_cluster_count: 128

Oracle K sanity check:

diarizer:
  clustering:
    parameters:
      oracle_num_speakers: true
# In your manifest, add per file: "num_speakers": 11

Glossary (fast lookup)

Scale: subsegment slicing rule (window, shift) applied only on VAD spans. Shortest scale = base.
Base scale: shortest window; the final label timeline.
Affinity fusion: weighted sum of per-scale similarity matrices, expanded to base size.
Feature fusion: weighted sum of per-scale embeddings → single fused base embedding sequence.
NME‑SC: method that picks k‑NN density and speaker count by maximizing Laplacian eigengap.
Overclustering: local “too many clusters” per window to produce pure centroids for cheap global clustering.
MSDD: multiscale diarization decoder predicting per‑speaker activity over time; overlap capable; uses speakers discovered by clustering.

Observed phenomenon: women’s voices often merge into one cluster

In practice we observed that male voices are differentiated more easily, whereas multiple female speakers frequently collapse into a single cluster.
Likely causes:
- Embedding space geometry: female voices occupy a tighter acoustic region (F0/formants), making cosine affinities look too similar; communities blur and NME‑SC undercounts.
- Graph density: denser k‑NN graphs (larger p) connect speakers across communities; eigengap then favors fewer clusters.
- Segmentation extremes: very short base windows emphasize transient similarities; very long windows average timbre and mask differences.
Mitigations:
- Prefer strong embedders (e.g., titanet_large) or domain‑tuned models with balanced female data.
- Rebalance scale weights toward mid windows (e.g., 0.5–1.5 s) and down‑weight extremes.
- In long‑form, raise chunk_cluster_count (e.g., ≥128) to avoid premature merges.
- If possible, expose and sweep fixed_thres to enforce sparser k‑NN graphs (≈0.10–0.16), which tends to separate communities.

Notes from experiments

Applying the tips and the tuning playbook typically resulted in only small changes to the estimated number of speakers. In particular, introducing very short base scales (< 1.0 s) often decreased the estimated count and degraded overall diarization on my data.
In practice, the estimated speaker count is largely governed by NME‑SC’s eigengap analysis. Usual tuning affects it marginally; it rarely jumps from, say, 7–8 to 11 on its own.
To materially steer the count, it is often necessary to expose and use parameters like fixed_thres (and related knobs), i.e., modify the code path to bypass the standard NME scan and set the graph density directly.

ZhengRui/how-nemo-diarizer-works.md