Last active
January 16, 2018 04:03
-
-
Save zakne/06618f0af3ddd490df6e8701f0c402c9 to your computer and use it in GitHub Desktop.
GSoC vp9 decoder improvements report
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
During GSoC I was able to accomplish the following: optimize part of the ipred functions and implement tile threading support. | |
Links for optimized avx2 ipred functions: | |
http://git.videolan.org/?p=ffmpeg.git;a=commit;h=35a5d9715dd82fd00f1d1401ec6be2d3e2eea81c | |
http://git.videolan.org/?p=ffmpeg.git;a=commit;h=81fc617c125734aa6f3b3d938af75fef6db750e7 | |
http://git.videolan.org/?p=ffmpeg.git;a=commit;h=73d9a9a6af5d00cfa9b98c7d9fc9abd0c734ba8e | |
Links for the tile threading code: | |
http://ffmpeg.org/pipermail/ffmpeg-devel/2017-August/215363.html | |
http://ffmpeg.org/pipermail/ffmpeg-devel/2017-August/215361.html | |
http://ffmpeg.org/pipermail/ffmpeg-devel/2017-August/215393.html | |
Tile threading support is not commited to the main repository yet as it's being reviewed by developers. | |
After my changes I got these performance numbers: | |
Tile threading is ~45% faster at 2 threads vs 1. | |
Frame threading is ~55% faster at 2 threads vs 1. | |
ffvp9 tile threading is ~25% faster than libvpx-vp9 at 2 threads | |
There were a few challenging places when I was writing tile threading support, | |
one is to debug a multithreaded application in general, which is not that easy and I had little experience prior to it, but I | |
learned a lot and overall it has been a good experience, although I spent a lot of time on it. | |
Second, more specific, is making the loopfilter work with allocating small | |
VP9Filter *lflvl structure like 4 super block rows, but we don't know how far behind the working threads the loopfilter is, | |
so we need to synchronize the loopfilter and working threads, so the working threads don't overwrite lflvl structure with the rows | |
information that are ahead. And that's been quiet challenging, because the synchronization didn't work on the second frame, and it was | |
hard to debug it, to see what the problem is, I spent a lot of time on this. | |
Not to waste time, I solved it by allocating lflvl structure with the amount of superblock rows there are in a frame, | |
that eliminates race conditions completely, but requires a little bit more memory. | |
Overall, I am really happy with the work I done, although I hoped I would write a lot more code for vp9, | |
but this is my first time working on such a big project, and I got a lot of experience. | |
I still have to do: avx2 assembly for the loopfilter, alpha channel support and finish writing avx2 assembly | |
for the ipred functions. Looking forward for that! | |
UPD: As of 08.09.2017 tile threading has been commited to the main repository, links: | |
http://git.videolan.org/?p=ffmpeg.git;a=commit;h=e59da0f7ff129d570adb72c6479f7ce07cf5a0f9 | |
http://git.videolan.org/?p=ffmpeg.git;a=commit;h=83c12fefd22fc2326a000019e5c1a33e90a874e8 |
@rubdos I haven't got a chance to test SIMD code on a Ryzen cpu yet, so I can't tell the exact numbers.
Are there any numbers that say what kind of performance improvements AVX wins in vp9 decoding? All the example numbers are comparing 2 vs 1 threads. What do number look like at 1 vs 1 thread?
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I'm interested in knowing how this performs on Ryzen, given that AVX is implemented in halves. Any insights?