To allow the Hugging Face version of Whisper to predict word-level timestamps, a new property alignment_heads
must be added to the GenerationConfig
object. This is a list of [layer, head]
pairs that select the cross-attention heads that are highly correlated to word-level timing.
If your Whisper checkpoint does not have the alignment_heads
property yet, it can be added in two possible ways.
Method 1. Change the model.generation_config
property:
# load the model
model = WhisperForConditionalGeneration.from_pretrained("your_checkpoint")