This summer I worked on dav1d, the AV1 decoder in VLC, Firefox, Chromium and many others. My task was to try to add GPU support to it, either because it may be faster, it may use lower power consumption, or simply because the CPU can do other stuff during this time. I wanted to write a shader for one of the decoding stage, and invoke it using the different graphics APIs. Then I planned to make some testing and after that do the same for another decoding stage.
To be accepted in GSoC VideoLAN asked me to write a simple patch to dav1d. I worked to add a few benchmarking options to the dav1d CLI, notably a realtime mode that can be used to benchmark power consumption in realistic playback conditions.
https://code.videolan.org/videolan/dav1d/merge_requests/670
I first worked on an OpenGL ES 3.1 version of the sgr filter, but that didn't work well in terms of performance. At all. The main bottleneck seemed to be the memory management, and that can not really be worked around in OpenGL without using extensions that are only supported by a few vendors. So I moved to Vulkan.
https://code.videolan.org/stebler/dav1d/compare/master...gles
On Vulkan it is possible to directly access the memory of integrated GPUs (this whole project is mainly for mobile devices), which simplifies memory transfer a lot. This time I worked on the simpler Wiener filter so the bringup was faster, but performance was lacking here too. The reason was that the program sent commands to the GPU for every single restoration unit (block of max 384 x 64 pixels). So I had to try a different approach where we first fill a temporary buffer with multiple restorations units, and the GPU process them all at once. The code became quite messy, so I refactored most of it and moved to a separate branch.
https://code.videolan.org/stebler/dav1d/compare/master...vulkan
I also added sgr filter support for Vulkan. Performance was way better but still not good enough (~3 times slower than optimized AVX2 asm for sgr, ~5 times for Wiener), even when using multiple threads (to do something else while some of them are waiting on the GPU). So I had to focus on one decoding stage that accounted for more than 1-2% of the total time. The cdef filter was one of them (~11% for the sample I worked with). I wrote a shader for a special case of it too and that time, the performance was a little more promising. But my code had become a mess again so I worked on a refactoring.
I replaced a lot of constants that worked well for my machine by automatic detection. I added conditional compilation of shaders, full cdef filter support, as well as 10 and 12 bit per channel support. I also cleaned up a lot of the code.
https://code.videolan.org/stebler/dav1d/compare/master...vulkan3
At the moment my branch is absolutely not ready to be merged, there is still a lot to do. First of all I have to make more tests to determine for each of the filters if it's pointless to do it on the GPU. That may depend a lot on the hardware. Then I will have to work on the stability of the code my doing more testing, as well as the code style.