Skip to content

Instantly share code, notes, and snippets.

@sanchit-gandhi
Last active December 7, 2022 15:50
Show Gist options
  • Save sanchit-gandhi/705ab353ed41ad8cf9ac08a06bdde5de to your computer and use it in GitHub Desktop.
Save sanchit-gandhi/705ab353ed41ad8cf9ac08a06bdde5de to your computer and use it in GitHub Desktop.
Proposal Whisper Announcement Post

Whisper in 🤗 Transformers

Whisper is available in the Hugging Face Transformers library from Version 4.23.1, with both PyTorch and TensorFlow implementations. All the official checkpoints can be found on the Hugging Face Hub, alongside documentation and examples scripts.

Fine-Tuning

Using the 🤗 Trainer, Whisper can be fine-tuned for speech recognition and speech translation tasks, boosting the performance of the model especially on low-resource languages. Refer to the blog post for a complete guide on fine-tuning Whisper. If you're interested in fine-tuning Whisper in your language, join us for our two-week Whisper fine-tuning event!

Evaluation

See the following example for evaluating Whisper on the LibriSpeech ASR dataset.

First, install the relevant Hugging Face packages:

pip -U transformers datasets evaluate

Next, run the Python code cell to evaluate on the "test-clean" subset of LibriSpeech. You can change the model checkpoint to any one of the official checkpoints on the Hugging Face Hub.

from datasets import load_dataset
from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torch
from evaluate import load

librispeech_test_clean = load_dataset("librispeech_asr", "clean", split="test")

processor = WhisperProcessor.from_pretrained("openai/whisper-base.en")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-base.en").to("cuda")

def map_to_pred(batch):
    audio = batch["audio"]
    input_features = processor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt").input_features
    batch["reference"] = processor.tokenizer._normalize(batch['text'])

    with torch.no_grad():
        predicted_ids = model.generate(input_features.to("cuda"))[0]
    transcription = processor.decode(predicted_ids)
    batch["prediction"] = processor.tokenizer._normalize(transcription)
    return batch

result = librispeech_test_clean.map(map_to_pred)

wer = load("wer")
print(100 * wer.compute(references=result["reference"], predictions=result["prediction"]))

Print Output:

4.254436419182681

Multi-Dataset Evaluation

We provide a Google Colab for evaluating Whisper on eight English speech recognition datasets in one script. This serves as a template for performing multi-dataset evaluation in a style similar to the official Whisper paper.

Dataset name Whisper tiny.en Whisper small.en
LibriSpeech Clean 5.66 3.05
LibriSpeech Other 15.38 7.53
Common Voice 31.17 15.20
VoxPopuli 12.58 8.45
TEDLIUM 14.28 12.21
GigaSpeech 14.07 11.36
SPGISpeech 5.82 3.63
Earnings-22 13.79 16.40
AMI 24.68 17.88
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment