pyannote |
NeMo |
|
---|---|---|
Voice Activity Detection (VAD) | Pyannet derived from Syncnet | MarbleNet |
Audio embedding | ECAPA-TDNN | TitaNet |
Clustering | Hidden Markov Model clustering | Multi-scale clustering (MSDD) |
pyannote |
Nemo |
|
---|---|---|
Pre-trained models available | ✅ | ✅ |
Good overlapping speakers detection (multilabel segmentation) | ✅ | ➖ |
Easy integration with ASR task and downstream NLP tasks | ➖ | ✅ |
Possibility to specify the number of speaker as a parameter for inference | ✅ | ✅ |
Automatic detection of the number of speakers | ✅ | ✅ |
Models available for specific use cases (phone call, outdoor conversation, high quality,…) | ❌ | ✅ |
Highly customizable pipeline | ➖ | ✅ |
Model | DER | CDER | BER | MD | FA | SC |
---|---|---|---|---|---|---|
pyannote | 0.10 | 0.14 | 0.18 | 0.01 | 0.06 | 0.03 |
NeMo - default parameters | 0.37 | 0.32 | 0.44 | 0.36 | 0.01 | 0.01 |
NeMo - optimized VAD parameters | 0.11 | 0.16 | 0.15 | 0.04 | 0.06 | 0.01 |
Model | DER | CDER | BER | MS | FA | SC |
---|---|---|---|---|---|---|
Pyannote - 7 clusters specified | 0.49 | 0.80 | 0.74 | 0.07 | 0.23 | 0.19 |
Nemo - no cluster number specified | 0.17 | 0.84 | 0.24 | 0.08 | 0.07 | 0.02 |
Nemo - manual parameter tuning | 0.12 | 0.15 |
Error | Definition |
---|---|
False Alarm | Speech segment predicted where there is no speaker (False positive from VAD model) |
Missed Detection | No speech detected where there is a speaker (False negative from VAD model) |
Confusion | Speech is in the wrong cluster (error from the clustering model) |
Error | Definition | |||
---|---|---|---|---|
False Alarm | Speech segment predicted where there is no speaker (False positive from VAD model) | |||
Missed Detection | No speech detected where there is a speaker (False negative from VAD model) | |||
Confusion | Speech is in the wrong cluster (error from the clustering model) |

NewerOlder