Whilst the current ASR landscape is really promosing a lot of it is currently benchmarked on rather "clean" datasets. This often creates a false sense of confidence in the Architecture which might not translate to the real world.
- Gaussian White Noise
- Real World Noise
- Choppy audio (random 1-2s removed from the audio snippet)
- Speed up (random 10s snippets sped up than the rest)
- WER
- CER
- English
- German
- Robust Speech Dataset
- Common Voice
- Tracking evaluation metrics across noise types with decreasing Signal-to-Noise (SNR) Ratio (w/ & w/o LM)
- Tracking evaluation metrics across noise types with an explicit Speech Enhancement preprocessing step (w/ & w/o LM)
- Wav2Vec2
- HuBERT
- Data2Vec
- UniSpeech