To allow the Hugging Face version of Whisper to predict word-level timestamps, a new property alignment_heads
must be added to the GenerationConfig
object. This is a list of [layer, head]
pairs that select the cross-attention heads that are highly correlated to word-level timing.
If your Whisper checkpoint does not have the alignment_heads
property yet, it can be added in two possible ways.
Method 1. Change the model.generation_config
property:
# load the model
model = WhisperForConditionalGeneration.from_pretrained("your_checkpoint")
# set the new property
model.generation_config.alignment_heads = [[2, 2], [3, 0], [3, 2], [3, 3], [3, 4], [3, 5]]
Method 2. Add a new line to the generation_config.json
file:
"alignment_heads": [[2, 2], [3, 0], [3, 2], [3, 3], [3, 4], [3, 5]],
After you're done, use push_to_hub
to make these changes permanent:
model.push_to_hub("your_pretrained_checkpoint", use_auth_token="your_token_if_not_logged_in", create_pr=True)
The correct values for alignment_heads
depend on the size of the model. Here are the appropriate values for the different Whisper model sizes. These are taken from the OpenAI checkpoints. If you fine-tuned your own checkpoint, you may need to inspect the cross-attention weights to find the appropriate layers and attention heads.
whisper-tiny: [[2, 2], [3, 0], [3, 2], [3, 3], [3, 4], [3, 5]]
whisper-tiny.en: [[1, 0], [2, 0], [2, 5], [3, 0], [3, 1], [3, 2], [3, 3], [3, 4]]
whisper-base: [[3, 1], [4, 2], [4, 3], [4, 7], [5, 1], [5, 2], [5, 4], [5, 6]]
whisper-base.en: [[3, 3], [4, 7], [5, 1], [5, 5], [5, 7]]
whisper-small: [[5, 3], [5, 9], [8, 0], [8, 4], [8, 7], [8, 8], [9, 0], [9, 7], [9, 9], [10, 5]]
whisper-small.en: [[6, 6], [7, 0], [7, 3], [7, 8], [8, 2], [8, 5], [8, 7], [9, 0], [9, 4], [9, 8], [9, 10], [10, 0], [10, 1], [10, 2], [10, 3], [10, 6], [10, 11], [11, 2], [11, 4]]
whisper-medium: [[13, 15], [15, 4], [15, 15], [16, 1], [20, 0], [23, 4]]
whisper-medium.en: [[11, 4], [14, 1], [14, 12], [14, 14], [15, 4], [16, 0], [16, 4], [16, 9], [17, 12], [17, 14], [18, 7], [18, 10], [18, 15], [20, 0], [20, 3], [20, 9], [20, 14], [21, 12]]
whisper-large-v1: [[9, 19], [11, 2], [11, 4], [11, 17], [22, 7], [22, 11], [22, 17], [23, 2], [23, 15]]
whisper-large-v2: [[10, 12], [13, 17], [16, 11], [16, 12], [16, 13], [17, 15], [17, 16], [18, 4], [18, 11], [18, 19], [19, 11], [21, 2], [21, 3], [22, 3], [22, 9], [22, 12], [23, 5], [23, 7], [23, 13], [25, 5], [26, 1], [26, 12], [27, 15]]
whisper-large: same as large-v2
checkout this repo for the best individual alignment heads of the large v2 and v3 variants.
https://github.com/nyrahealth/CrisperWhisper/blob/develop/run_experiments/experiments/head_results.json
The Repo also contains useful code to calculate these head results for other models as long as you are in the posession of a dataset with high quality timestamps like timit ( https://paperswithcode.com/dataset/timit).
The Idea here is to just evaluate the timings output by DTW for individual alignment heads against ground truth data and see which ones score highest. This variant improves upon this by specifically training alignment heads.
https://github.com/nyrahealth/CrisperWhisper