Skip to content

Instantly share code, notes, and snippets.

@cedrickchee
Last active August 21, 2024 07:50
Show Gist options
  • Save cedrickchee/5a817af90f5293237aee568f840cc4c1 to your computer and use it in GitHub Desktop.
Save cedrickchee/5a817af90f5293237aee568f840cc4c1 to your computer and use it in GitHub Desktop.
Trying whisperfile

Trying whisperfile

llamafile v0.8.13 (and whisperfile) is out:

This release introduces whisperfile which is a single-file implementation of OpenAI's Whisper model. It lets you transcribe speech to text and even translate it too. Our implementation is based off Georgi Gerganov's whisper.cpp project.

The project to turn it into a whisperfile was founded by CJ Pais who's handed over maintenance of his awesome work.

I want to kick the tires of whisperfile. I will transcribe a podcast audio with whisperfile.

Download Whisper medium Q5_0 quantized model weights:

$ curl -L -o ggml-medium-q5_0.bin https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-medium-q5_0.bin

Ask Llama 3.1 405B to write a CLI command for converting m4a to wav format: https://hf.co/chat/r/_no86nj

$ ffmpeg -i input.m4a -ar 16000 -ac 2 output.wav

Transcribe audio with whisperfile:

$ ./whisperfile-0.8.13 -m ggml-medium-q5_0.bin -f pod-1.wav --output-json --output-file pod-1-transcript --flash-attn
whisper_init_from_file_with_params_no_state: loading model from 'ggml-medium-q5_0.bin'
whisper_init_with_params_no_state: cuda gpu   = 0
whisper_init_with_params_no_state: metal gpu  = 0
whisper_init_with_params_no_state: flash attn = 1
whisper_init_with_params_no_state: gpu_device = 0
whisper_init_with_params_no_state: dtw        = 0
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1024
whisper_model_load: n_audio_head  = 16
whisper_model_load: n_audio_layer = 24
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1024
whisper_model_load: n_text_head   = 16
whisper_model_load: n_text_layer  = 24
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 8
whisper_model_load: qntvr         = 1
whisper_model_load: type          = 4 (medium)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: n_langs       = 99
whisper_model_load:      CPU total size =   538.59 MB
whisper_model_load: model size    =  538.59 MB
whisper_init_state: kv self size  =  150.99 MB
whisper_init_state: kv cross size =  150.99 MB
whisper_init_state: kv pad  size  =    6.29 MB
whisper_init_state: compute buffer (conv)   =   28.81 MB
whisper_init_state: compute buffer (encode) =  180.01 MB
whisper_init_state: compute buffer (cross)  =    7.98 MB
whisper_init_state: compute buffer (decode) =   98.45 MB

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0

main: processing 'pod-1.wav' (15623168 samples, 976.4 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = en, task = transcribe, timestamps = 1 ...

[... truncated ...]

output_json: saving output to 'pod-1-transcript-.json'

whisper_print_timings:     load time =  8084.27 ms
whisper_print_timings:     fallbacks =   1 p /   2 h
whisper_print_timings:      mel time =  1293.30 ms
whisper_print_timings:   sample time = 16927.53 ms / 22203 runs (    0.76 ms per run)
whisper_print_timings:   encode time = -536733.75 ms /    35 runs (-15335.25 ms per run)
whisper_print_timings:   decode time =  2142.09 ms /    41 runs (   52.25 ms per run)
whisper_print_timings:   batchd time = 748232.56 ms / 21985 runs (   34.03 ms per run)
whisper_print_timings:   prompt time = 218609.02 ms /  7743 runs (   28.23 ms per run)
whisper_print_timings:    total time = 458874.28 ms

Note that the previous command run whisper with flash attention.

You can let whisper auto-detect CUDA and use your GPU if you want:

$ ./whisperfile-0.8.13 -m ggml-medium-q5_0.bin -f pod-1.wav --output-json --output-file pod-1-transcript --gpu auto

whisperfile also supports speaker diarization for speaker separations if you set the --diarize flag:

$ ./whisperfile-0.8.13 -m ggml-medium-q5_0.bin -f pod-1.wav --output-json --output-file pod-1-transcript --diarize

I Tweeted about this: https://x.com/cedric_chee/status/1825778199823004073

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment