Robust ASR: An applied survey of current SoTA ASR architectures

Motivation

Whilst the current ASR landscape is really promosing a lot of it is currently benchmarked on rather "clean" datasets. This often creates a false sense of confidence in the Architecture which might not translate to the real world.

Types of Noises

Gaussian White Noise
Real World Noise
Choppy audio (random 1-2s removed from the audio snippet)
Speed up (random 10s snippets sped up than the rest)

Evaluation

Languages (open to more)

English
German

Dataset

Robust Speech Dataset
Common Voice

Experiments (Across noise types)

Tracking evaluation metrics across noise types with decreasing Signal-to-Noise (SNR) Ratio (w/ & w/o LM)
Tracking evaluation metrics across noise types with an explicit Speech Enhancement preprocessing step (w/ & w/o LM)

Candidate Architectures

Wav2Vec2
HuBERT
Data2Vec
UniSpeech

Vaibhavs10/robust-asr.md