ljnmedium

	`pyannote`	`NeMo`
Voice Activity Detection (VAD)	Pyannet derived from Syncnet	MarbleNet
Audio embedding	ECAPA-TDNN	TitaNet
Clustering	Hidden Markov Model clustering	Multi-scale clustering (MSDD)

	`pyannote`	`Nemo`
Pre-trained models available	✅	✅
Good overlapping speakers detection (multilabel segmentation)	✅	➖
Easy integration with ASR task and downstream NLP tasks	➖	✅
Possibility to specify the number of speaker as a parameter for inference	✅	✅
Automatic detection of the number of speakers	✅	✅
Models available for specific use cases (phone call, outdoor conversation, high quality,…)	❌	✅
Highly customizable pipeline	➖	✅

Model	DER	CDER	BER	MD	FA	SC
pyannote	0.10	0.14	0.18	0.01	0.06	0.03
NeMo - default parameters	0.37	0.32	0.44	0.36	0.01	0.01
NeMo - optimized VAD parameters	0.11	0.16	0.15	0.04	0.06	0.01

Model	DER	CDER	BER	MS	FA	SC
Pyannote - 7 clusters specified	0.49	0.80	0.74	0.07	0.23	0.19
Nemo - no cluster number specified	0.17	0.84	0.24	0.08	0.07	0.02
Nemo - manual parameter tuning	0.12		0.15

Error	Definition
False Alarm	Speech segment predicted where there is no speaker (False positive from VAD model)
Missed Detection	No speech detected where there is a speaker (False negative from VAD model)
Confusion	Speech is in the wrong cluster (error from the clustering model)

Error	Definition
False Alarm	Speech segment predicted where there is no speaker (False positive from VAD model)
Missed Detection	No speech detected where there is a speaker (False negative from VAD model)
Confusion	Speech is in the wrong cluster (error from the clustering model)