Whisper is available in the Hugging Face Transformers library from Version 4.23.1, with both PyTorch and TensorFlow implementations. All the official checkpoints can be found on the Hugging Face Hub, alongside documentation and examples scripts.
Using the 🤗 Trainer, Whisper can be fine-tuned for speech recognition and speech translation tasks, boosting the performance of the model especially on low-resource languages. Refer to the blog post for a complete guide on fine-tuning Whisper. If you're interested in fine-tuning Whisper in your language, join us for our two-week Whisper fine-tuning event!
See the following example for evaluating Whisper on the LibriSpeech ASR dataset.
First, install the relevant Hugging Face packages:
pip -U transformers datasets evaluate
Next, run the Python code cell to evaluate on the "test-clean" subset of LibriSpeech. You can change the model checkpoint to any one of the official checkpoints on the Hugging Face Hub.
from datasets import load_dataset
from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torch
from evaluate import load
librispeech_test_clean = load_dataset("librispeech_asr", "clean", split="test")
processor = WhisperProcessor.from_pretrained("openai/whisper-base.en")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-base.en").to("cuda")
def map_to_pred(batch):
audio = batch["audio"]
input_features = processor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt").input_features
batch["reference"] = processor.tokenizer._normalize(batch['text'])
with torch.no_grad():
predicted_ids = model.generate(input_features.to("cuda"))[0]
transcription = processor.decode(predicted_ids)
batch["prediction"] = processor.tokenizer._normalize(transcription)
return batch
result = librispeech_test_clean.map(map_to_pred)
wer = load("wer")
print(100 * wer.compute(references=result["reference"], predictions=result["prediction"]))
Print Output:
4.254436419182681
We provide a Google Colab for evaluating Whisper on eight English speech recognition datasets in one script. This serves as a template for performing multi-dataset evaluation in a style similar to the official Whisper paper.
Dataset name | Whisper tiny.en | Whisper small.en |
---|---|---|
LibriSpeech Clean | 5.66 | 3.05 |
LibriSpeech Other | 15.38 | 7.53 |
Common Voice | 31.17 | 15.20 |
VoxPopuli | 12.58 | 8.45 |
TEDLIUM | 14.28 | 12.21 |
GigaSpeech | 14.07 | 11.36 |
SPGISpeech | 5.82 | 3.63 |
Earnings-22 | 13.79 | 16.40 |
AMI | 24.68 | 17.88 |