Proposal Whisper Announcement Post

Whisper in 🤗 Transformers

Whisper is available in the Hugging Face Transformers library from Version 4.23.1, with both PyTorch and TensorFlow implementations. All the official checkpoints can be found on the Hugging Face Hub, alongside documentation and examples scripts.

Fine-Tuning

Using the 🤗 Trainer, Whisper can be fine-tuned for speech recognition and speech translation tasks, boosting the performance of the model especially on low-resource languages. Refer to the blog post for a complete guide on fine-tuning Whisper. If you're interested in fine-tuning Whisper in your language, join us for our two-week Whisper fine-tuning event!

Evaluation

See the following example for evaluating Whisper on the LibriSpeech ASR dataset.

First, install the relevant Hugging Face packages:

pip -U transformers datasets evaluate

Next, run the Python code cell to evaluate on the "test-clean" subset of LibriSpeech. You can change the model checkpoint to any one of the official checkpoints on the Hugging Face Hub.

from datasets import load_dataset
from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torch
from evaluate import load

librispeech_test_clean = load_dataset("librispeech_asr", "clean", split="test")

processor = WhisperProcessor.from_pretrained("openai/whisper-base.en")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-base.en").to("cuda")

def map_to_pred(batch):
    audio = batch["audio"]
    input_features = processor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt").input_features
    batch["reference"] = processor.tokenizer._normalize(batch['text'])

    with torch.no_grad():
        predicted_ids = model.generate(input_features.to("cuda"))[0]
    transcription = processor.decode(predicted_ids)
    batch["prediction"] = processor.tokenizer._normalize(transcription)
    return batch

result = librispeech_test_clean.map(map_to_pred)

wer = load("wer")
print(100 * wer.compute(references=result["reference"], predictions=result["prediction"]))

Print Output:

4.254436419182681

Multi-Dataset Evaluation

We provide a Google Colab for evaluating Whisper on eight English speech recognition datasets in one script. This serves as a template for performing multi-dataset evaluation in a style similar to the official Whisper paper.

Dataset name	Whisper tiny.en	Whisper small.en
LibriSpeech Clean	5.66	3.05
LibriSpeech Other	15.38	7.53
Common Voice	31.17	15.20
VoxPopuli	12.58	8.45
TEDLIUM	14.28	12.21
GigaSpeech	14.07	11.36
SPGISpeech	5.82	3.63
Earnings-22	13.79	16.40
AMI	24.68	17.88

sanchit-gandhi/README.md

Whisper in 🤗 Transformers

Fine-Tuning

Evaluation

Multi-Dataset Evaluation