Little GPU ffmpeg NVENC encoding benchmarks

Intro
NVIDIA RTX A4000
NVIDIA RTX A2000
RTX 4000
NVIDIA T1000
NVIDIA RTX 2080 Ti
Quadro P4000
Conslusion

Intro

Input data:

Input #0, mov,mp4,m4a,3gp,3g2,mj2, from './video.mp4':
  Metadata:
    major_brand     : isom
    minor_version   : 512
    compatible_brands: isomiso2avc1mp41
    encoder         : Lavf58.29.100
  Duration: 01:23:05.33, start: 0.000000, bitrate: 3005 kb/s
    Stream #0:0(und): Video: h264 (avc1 / 0x31637661), yuv420p, 1920x1080 [SAR 1:1 DAR 16:9], 2870 kb/s, 30 fps, 30 tbr, 90k tbn, 60 tbc (default)
    Metadata:
      handler_name    : VideoHandler
    Stream #0:1(und): Audio: aac (mp4a / 0x6134706D), 48000 Hz, stereo, fltp, 128 kb/s (default)
    Metadata:
      handler_name    : SoundHandler

PC Specs:

i9-10900k water cooled
64 GB Ram @ 3Ghz
Ubuntu 21.10 kernel 5.13.0-28-generic
nvidia-headless-470
nvidia-docker2

Custom build FFMPEG(bethrezen/ffmpeg-rtmp-nvenc docker image(sha256:0c17d7133066e0e305aba51d9abd28a4b07259c8dd719c1d072d073fd3baa6fb) based on Cuda 10.2):

ffmpeg version 4.3.1 Copyright (c) 2000-2020 the FFmpeg developers
  built with gcc 7 (Ubuntu 7.5.0-3ubuntu1~18.04)
  configuration: --enable-version3 --enable-gpl --enable-nonfree --enable-small --enable-libmp3lame --enable-libx264 --enable-libtheora --enable-libvorbis --enable-libopus --enable-libfdk-aac --enable-libass --enable-librtmp --enable-postproc --enable-avresample --enable-libfreetype --enable-openssl --enable-avfilter --enable-libxvid --enable-pic
--enable-shared --enable-pthreads --enable-shared --enable-nvenc --enable-cuda --enable-opencl --enable-cuvid --enable-libnpp --disable-stripping --disable-static --disable-debug --extra-cflags=-I/usr/local/cuda/include --extra-ldflags=-L/usr/local/cuda/lib64
  libavutil      56. 51.100 / 56. 51.100
  libavcodec     58. 91.100 / 58. 91.100
  libavformat    58. 45.100 / 58. 45.100
  libavdevice    58. 10.100 / 58. 10.100
  libavfilter     7. 85.100 /  7. 85.100
  libavresample   4.  0.  0 /  4.  0.  0
  libswscale      5.  7.100 /  5.  7.100
  libswresample   3.  7.100 /  3.  7.100
  libpostproc    55.  7.100 / 55.  7.100

Fullhd -> Fullhd command:

time ffmpeg -y -v info -hwaccel cuvid  -hwaccel_device 0 -c:v h264_cuvid -i ./video.mp4 \
-c:a aac -b:a 128k -c:v h264_nvenc -b:v 3500k -maxrate 3500k -preset fast -profile:v baseline -g 30 -r 30 out_1080.mp4

Transcoding command:

time ffmpeg -y -v info -hwaccel cuvid  -hwaccel_device 0 -c:v h264_cuvid -i ./video.mp4 \
-c:a aac -b:a 128k -c:v h264_nvenc -b:v 3500k -maxrate 3500k -preset fast -profile:v baseline -g 30 -r 30 out_1080.mp4 \
-c:a aac -b:a 128k -c:v h264_nvenc -b:v 2000k -maxrate 2000k -vf hwupload,scale_npp=w=1280:h=720 -preset fast -profile:v baseline -g 30 -r 30  out_720.mp4 \
-c:a aac -b:a 128k -c:v h264_nvenc -b:v 1000k -maxrate 1000k -vf hwupload,scale_npp=w=854:h=480 -preset fast -profile:v baseline -g 30 -r 30  out_480.mp4

NVIDIA RTX A4000

Specs from datasheet:

16 GB GDDR6 256-bit 448 GB/s
Cuda Cores (Ampere): 6,144
Tensor cores (3rd gen): 192
RT Cores (2nd gen): 48
Single-precision performance: 19.2 TFLOPS
Total power consumption: 140W
Encoding streams: unlimited

Additional power: PCIe 6-pin

PCIe slots: 1

Purchase price: $2053

Fullhd -> Fullhd

~ 317 MB VRAM ± 65-74W

frame=149565 fps=825 q=22.0 Lsize= 1703147kB time=01:23:05.46 bitrate=2798.6kbits/s dup=5 drop=0 speed=27.5x
video:1627352kB audio:71575kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.248380%
[aac @ 0x562f54b56c40] Qavg: 7124.271

real    3m1.579s
user    1m8.791s
sys     0m3.517s

Transcoding Fullhd -> Fullhd + HD + SD

~ 379 MB VRAM ± 78-90W

frame=149565 fps=436 q=22.0 Lq=21.0 q=23.0 size= 1703147kB time=01:23:05.46 bitrate=2798.6kbits/s dup=15 drop=0 speed=14.5x
video:3157909kB audio:214724kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: unknown
[aac @ 0x55c0d10231c0] Qavg: 7124.271
[aac @ 0x55c0d101ed80] Qavg: 7124.271
[aac @ 0x55c0d1034280] Qavg: 7124.271

real    5m43.207s
user    3m14.678s
sys     0m4.525s

NVIDIA RTX A2000

Specs from datasheet:

6 GB GDDR6 192-bit 288 GB/s
Cuda Cores (Ampere): 3,328
Tensor cores (3rd gen): 104
RT Cores (2nd gen): 26
Single-precision performance: 8.0 TFLOPS
Total power consumption: 70W
Encoding streams: unlimited

Additional power: not required

PCIe slots: 2

Purchase price: $964

Fullhd -> Fullhd

~ 256 MB VRAM ± 55W

frame=149565 fps=816 q=22.0 Lsize= 1703147kB time=01:23:05.46 bitrate=2798.6kbits/s dup=5 drop=0 speed=27.2x
video:1627352kB audio:71575kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.248380%
[aac @ 0x5619cec43cc0] Qavg: 7124.271

real    3m3.436s
user    1m7.428s
sys     0m2.941s

UPD8: Launching the same test on i9-12900K CPU but on PCIe 3.0 4x lane resulted in 8% fps penalty. Seems like PCIe bandwidth matters a lot here.

Transcoding Fullhd -> Fullhd + HD + SD

~ 317 MB VRAM ± 65W

frame=149565 fps=460 q=22.0 Lq=21.0 q=23.0 size= 1703147kB time=01:23:05.46 bitrate=2798.6kbits/s dup=15 drop=0 speed=15.3x
video:3157909kB audio:214724kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: unknown
[aac @ 0x5596ac0e11c0] Qavg: 7124.271
[aac @ 0x5596ac0dcd80] Qavg: 7124.271
[aac @ 0x5596ac0f2280] Qavg: 7124.271

real    5m25.119s
user    2m51.941s
sys     0m4.682s

RTX 4000

Specs from datasheet:

8 GB GDDR6 192-bit 416 GB/s
Cuda Cores (Turing): 2,304
Tensor cores (? gen): 288
RT Cores (? gen): 36
Single-precision performance: 7.1 TFLOPS
Total power consumption: 125W
Encoding streams: unlimited

Additional power: PCIe 6+2-pin

PCIe slots: 1

Purchase price: $1287

Fullhd -> Fullhd

~ 256 MB VRAM ± 61W

frame=149565 fps=876 q=22.0 Lsize= 1737670kB time=01:23:05.46 bitrate=2855.3kbits/s dup=5 drop=0 speed=29.2x
video:1661875kB audio:71575kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.243433%
[aac @ 0x55d81d057cc0] Qavg: 7124.271

real    2m51.125s
user    1m7.820s
sys     0m3.456s

Transcoding Fullhd -> Fullhd + HD + SD

~ 315 MB VRAM ± 70W

frame=149565 fps=492 q=22.0 Lq=21.0 q=23.0 size= 1737670kB time=01:23:05.46 bitrate=2855.3kbits/s dup=15 drop=0 speed=16.4x
video:3214019kB audio:214724kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: unknown
[aac @ 0x562a5beda1c0] Qavg: 7124.271
[aac @ 0x562a5bed5d80] Qavg: 7124.271
[aac @ 0x562a5beeb280] Qavg: 7124.271

real    5m4.469s
user    2m48.625s
sys     0m4.851s

NVIDIA T1000

Specs from datasheet:

4 GB GDDR6 128-bit 160 GB/s
Cuda Cores (Turing): 896
Tensor cores: none
RT Cores: none
Single-precision performance: <= 2.5 TFLOPS
Total power consumption: 50W
Encoding streams: 3

Additional power: not required

PCIe slots: 1

Purchase price: $435

Fullhd -> Fullhd

~ 209 MB VRAM <= 50W (nvidia-smi reports N/A)

frame=149565 fps=732 q=21.0 Lsize= 1732661kB time=01:23:05.46 bitrate=2847.1kbits/s dup=5 drop=0 speed=24.4x
video:1656867kB audio:71575kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.244139%
[aac @ 0x55f4f4a0ecc0] Qavg: 7124.271

real    3m24.583s
user    1m12.531s
sys     0m3.221s

Transcoding Fullhd -> Fullhd + HD + SD

~ 273 MB VRAM <= 50W (nvidia-smi reports N/A)

frame=149565 fps=405 q=21.0 Lq=20.0 q=22.0 size= 1732661kB time=01:23:05.46 bitrate=2847.1kbits/s dup=15 drop=0 speed=13.5x
video:3269555kB audio:214724kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: unknown
[aac @ 0x564098f571c0] Qavg: 7124.271
[aac @ 0x564098f52d80] Qavg: 7124.271
[aac @ 0x564098f68280] Qavg: 7124.271

real    6m9.938s
user    3m38.598s
sys     0m5.236s

NVIDIA RTX 2080 Ti

Disclaimer: this test was made on the same video but on a bit different server with i9-12900K CPU and 3600 Mhz RAM. Driver version: 510.47.03

Specs from datasheet:

11 GB GDDR6 352-bit 616 GB/s
Cuda Cores (Turing): 4352
Tensor cores: 544 ?
RT Cores: 68 ?
Single-precision performance: 13.45 TFLOPS?
Total power consumption: 260W
Encoding streams: 3

Additional power: PCIe 2x 6+2-pin

PCIe slots: 3

Purchase price: N/A

Fullhd -> Fullhd

~ 370 MB VRAM ±93W

frame=149565 fps=911 q=22.0 Lsize= 1737670kB time=01:23:05.46 bitrate=2855.3kbits/s dup=5 drop=0 speed=30.4x
video:1661875kB audio:71575kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.243433%
[aac @ 0x55a480458c40] Qavg: 7124.271

real    2m44.401s
user    1m21.541s
sys     0m2.731s

Transcoding Fullhd -> Fullhd + HD + SD

~ 370 MB VRAM ±104W

frame=149565 fps=521 q=22.0 Lq=21.0 q=23.0 size= 1737670kB time=01:23:05.46 bitrate=2855.3kbits/s dup=15 drop=0 speed=17.4x
video:3214019kB audio:214724kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: unknown
[aac @ 0x55ea220f81c0] Qavg: 7124.271
[aac @ 0x55ea220f3d80] Qavg: 7124.271
[aac @ 0x55ea22109280] Qavg: 7124.271

real    4m47.268s
user    3m17.701s
sys     0m4.718s

Quadro P4000

Disclaimer: this test was made on the same video but on a bit different server with AMD Ryzen 9 3950X and some load on it(but no nvenc load). Driver version: 450.119.03

Specs from datasheet:

8 GB GDDR5 256-bit 243 GB/s
Cuda Cores (Pascal): 1792
Tensor cores: none
RT Cores: none
Single-precision performance: 5.2 TFLOPS
Total power consumption: 105W
Encoding streams: unlimited

Additional power: PCIe 6-pin

PCIe slots: 1

Purchase price: N/A

Fullhd -> Fullhd

~ 234 MB VRAM ±43W

frame=149565 fps=644 q=21.0 Lsize= 1732661kB time=01:23:05.46 bitrate=2847.1kbits/s dup=5 drop=0 speed=21.5x
video:1656867kB audio:71575kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.244139%
[aac @ 0x56269dd00c40] Qavg: 7124.271

real	3m52.991s
user	1m38.497s
sys	0m3.769s

Transcoding Fullhd -> Fullhd + HD + SD

~ 294 MB VRAM ±50W

frame=149565 fps=384 q=21.0 Lq=20.0 q=22.0 size= 1732661kB time=01:23:05.46 bitrate=2847.1kbits/s dup=15 drop=0 speed=12.8x
video:3269555kB audio:214724kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: unknown
[aac @ 0x55cfae2e21c0] Qavg: 7124.271
[aac @ 0x55cfae2ddd80] Qavg: 7124.271
[aac @ 0x55cfae2f3280] Qavg: 7124.271

real	6m30.079s
user	3m59.463s
sys	0m5.913s

Conclusion

For video encoding NVIDIA T1000 (while having only 3 encoding streams) is the perfect cost/fps. If you need some real CUDA and/or Tensor cores - NVIDIA RTX A2000 is your choice. They both doesn't require additional PCIe power connectors.

For home-usage in some little machine learning things you choose between cheaper 8GB RTX 4000 and more expensive 16GB NVIDIA RTX A4000.

bethrezen/ffmpeg-benchmarks.md

Little GPU ffmpeg NVENC encoding benchmarks

Table of Contents

Intro

NVIDIA RTX A4000

Fullhd -> Fullhd

Transcoding Fullhd -> Fullhd + HD + SD

NVIDIA RTX A2000

Fullhd -> Fullhd

Transcoding Fullhd -> Fullhd + HD + SD

RTX 4000

Fullhd -> Fullhd

Transcoding Fullhd -> Fullhd + HD + SD

NVIDIA T1000

Fullhd -> Fullhd

Transcoding Fullhd -> Fullhd + HD + SD

NVIDIA RTX 2080 Ti

Fullhd -> Fullhd

Transcoding Fullhd -> Fullhd + HD + SD

Quadro P4000

Fullhd -> Fullhd

Transcoding Fullhd -> Fullhd + HD + SD

Conclusion

References