Alignment heads for Whisper word-level timestamps with Hugging Face Transformers

To allow the Hugging Face version of Whisper to predict word-level timestamps, a new property alignment_heads must be added to the GenerationConfig object. This is a list of [layer, head] pairs that select the cross-attention heads that are highly correlated to word-level timing.

If your Whisper checkpoint does not have the alignment_heads property yet, it can be added in two possible ways.

Method 1. Change the model.generation_config property:

# load the model
model = WhisperForConditionalGeneration.from_pretrained("your_checkpoint")

# set the new property
model.generation_config.alignment_heads = [[2, 2], [3, 0], [3, 2], [3, 3], [3, 4], [3, 5]]

Method 2. Add a new line to the generation_config.json file:

"alignment_heads": [[2, 2], [3, 0], [3, 2], [3, 3], [3, 4], [3, 5]],

After you're done, use push_to_hub to make these changes permanent:

model.push_to_hub("your_pretrained_checkpoint", use_auth_token="your_token_if_not_logged_in", create_pr=True)

The correct values for alignment_heads depend on the size of the model. Here are the appropriate values for the different Whisper model sizes. These are taken from the OpenAI checkpoints. If you fine-tuned your own checkpoint, you may need to inspect the cross-attention weights to find the appropriate layers and attention heads.

whisper-tiny: [[2, 2], [3, 0], [3, 2], [3, 3], [3, 4], [3, 5]]

whisper-tiny.en: [[1, 0], [2, 0], [2, 5], [3, 0], [3, 1], [3, 2], [3, 3], [3, 4]]

whisper-base: [[3, 1], [4, 2], [4, 3], [4, 7], [5, 1], [5, 2], [5, 4], [5, 6]]

whisper-base.en: [[3, 3], [4, 7], [5, 1], [5, 5], [5, 7]]

whisper-small: [[5, 3], [5, 9], [8, 0], [8, 4], [8, 7], [8, 8], [9, 0], [9, 7], [9, 9], [10, 5]]

whisper-small.en: [[6, 6], [7, 0], [7, 3], [7, 8], [8, 2], [8, 5], [8, 7], [9, 0], [9, 4], [9, 8], [9, 10], [10, 0], [10, 1], [10, 2], [10, 3], [10, 6], [10, 11], [11, 2], [11, 4]]

whisper-medium: [[13, 15], [15, 4], [15, 15], [16, 1], [20, 0], [23, 4]]

whisper-medium.en: [[11, 4], [14, 1], [14, 12], [14, 14], [15, 4], [16, 0], [16, 4], [16, 9], [17, 12], [17, 14], [18, 7], [18, 10], [18, 15], [20, 0], [20, 3], [20, 9], [20, 14], [21, 12]]

whisper-large-v1: [[9, 19], [11, 2], [11, 4], [11, 17], [22, 7], [22, 11], [22, 17], [23, 2], [23, 15]]

whisper-large-v2: [[10, 12], [13, 17], [16, 11], [16, 12], [16, 13], [17, 15], [17, 16], [18, 4], [18, 11], [18, 19], [19, 11], [21, 2], [21, 3], [22, 3], [22, 9], [22, 12], [23, 5], [23, 7], [23, 13], [25, 5], [26, 1], [26, 12], [27, 15]]

whisper-large: same as large-v2

hollance/alignment-heads.md

LaurinmyReha commented Sep 6, 2024

hashefa commented Sep 6, 2024

LaurinmyReha commented Sep 6, 2024 •

edited

Loading

hollance/alignment-heads.md

LaurinmyReha commented Sep 6, 2024

hashefa commented Sep 6, 2024

LaurinmyReha commented Sep 6, 2024 • edited Loading

LaurinmyReha commented Sep 6, 2024 •

edited

Loading