When working on an audio player, I wanted to extract the audio waveform data to paint the audio waveform dynamically in the browser on a <canvas>
element.
Initially I used the bbc/audiowaveform
package but this proved problematic for a number of reasons. First I wasn't able to install that package (or build the binary) in macOS for local dev. The other big issue is that I was only able to figure out how to install it on Ubuntu, so I couldn't use it in Alpine (for Docker images) or other environments like cloud functions.
I found out from these docs it's possible to paint a waveform with ffmpeg
by extracting raw audio data:
https://trac.ffmpeg.org/wiki/Waveform#UsingGnuplot
The idea is you can input basically anything into ffmpeg
(any audio or video file) and output some PCM raw audio into stdout
or a file. Then you can read that raw audio data and turn it into something usable to paint your waveform.
I got it working after a bit of tinkering but this approach requires you to downsample the audio. Otherwise you will produce a lot of raw audio data, specially for long audio files. When downsampling you can lose a lot detail in the audio which produces bad waveforms.
I'm documenting this approach here for reference:
ffmpeg -i test.wav -ac 1 -filter:a aresample=8000 -map 0:a -c:a pcm_s16le -f data -
Explanation
-i test.wav
input file-ac 1
mix all audio channels into one-filter:a aresample=8000
downsample to 8000 samples per second to to reduce the amount of data (typically 44100 samples per second)-map 0:a
select all audio streams from input 0-c:a pcm_s16le
This sets the sample format to 16 bits and you get values between 0 and 65,535 (in case the audio is 24 or 32 bits)-f data -
output binary data into sdtout
ffmpeg -i test.wav -ac 1 -filter:a aresample=8000 -map 0:a -c:a pcm_s16le -f data data.txt
Explanation
-i test.wav
input file-ac 1
mix all audio channels into one-filter:a aresample=8000
downsample to 8000 samples per second-map 0:a
select all audio streams from input 0-c:a pcm_s16le
sets the sample format to 16 bits-f data data.txt
output binary data into a text file for further processing
Depending on what you want to do, even downsampling to 8000 samples per second at 16bits per sample is going to be way too much data. My goal was to paint a waveform for an audio player so I really didn't need as much resolution so I went as far as 500 samples per second and 8bits per sample.
ffmpeg -i test.wav -ac 1 -filter:a aresample=500 -map 0:a -c:a pcm_8u -f data data.txt
Explanation
-i test.wav
input file-ac 1
mix all audio channels into one-filter:a aresample=500
downsample to 500 samples per second-map 0:a
select all audio streams from input 0-c:a pcm_8u
8 bits per sample (so values between 0 and 255)-f data data.txt
output binary data into a text file for further processing
This produced about 25kB of raw data per minute of audio which is easily parsed. Unforunately, like I explained before, the generated waveform doesn't really resemble the actual audio.
The better approach consists of using astats
to basically tell you the gain in decibels for a series of chunks (or rather frames) of samples.
ffmpeg -i audio.wav -af "aresample=44100,asetnsamples=4000,astats=reset=1:metadata=1,ametadata=print:key='lavfi.astats.Overall.Peak_level':file=stats.log" -f null -
Explanation:
aresample=44100
this will downsample the audio to 44100 in case your source is in higher sample rates.asetnsamples=4000
here you're defining the chunk size. So aprox each chunk will consist of 1/11th of a second.lavfi.astats.Overall.Peak_level
this is the value that will be printed to the file. If you check theastat
docs there are many more values that can be printed like RMS, etc.file=stats.log
where the data will be written to.
This is the result you will get in the stats.log
file which can be easily parsed.
frame:0 pts:0 pts_time:0
lavfi.astats.Overall.Peak_level=-72.246934
frame:1 pts:4000 pts_time:0.0907029
lavfi.astats.Overall.Peak_level=-72.246934
frame:2 pts:8000 pts_time:0.181406
lavfi.astats.Overall.Peak_level=-71.223883
frame:3 pts:12000 pts_time:0.272109
lavfi.astats.Overall.Peak_level=-71.223883
frame:4 pts:16000 pts_time:0.362812
lavfi.astats.Overall.Peak_level=-70.308734
frame:5 pts:20000 pts_time:0.453515
lavfi.astats.Overall.Peak_level=-69.480880
frame:6 pts:24000 pts_time:0.544218
lavfi.astats.Overall.Peak_level=-68.725109
frame:7 pts:28000 pts_time:0.634921
lavfi.astats.Overall.Peak_level=-49.640259
frame:8 pts:32000 pts_time:0.725624
lavfi.astats.Overall.Peak_level=-40.565966
So in the first chunk, the value you want is -72.246934
which is a logarithmic value in decibels (so 0db is the maximum value).
Thank you for this informative gist! I've tried to do this with .mp3 using lavfi but it doesn't generate the data sometimes.