Skip to content

Instantly share code, notes, and snippets.

@bethrezen
Last active January 31, 2025 10:28
Show Gist options
  • Save bethrezen/ceeb2a9f01ca0f4bafe3ba9157a865fe to your computer and use it in GitHub Desktop.
Save bethrezen/ceeb2a9f01ca0f4bafe3ba9157a865fe to your computer and use it in GitHub Desktop.
Little GPU ffmpeg NVENC encoding benchmarks

Little GPU ffmpeg NVENC encoding benchmarks

Table of Contents

Intro

Input data:

Input #0, mov,mp4,m4a,3gp,3g2,mj2, from './video.mp4':
  Metadata:
    major_brand     : isom
    minor_version   : 512
    compatible_brands: isomiso2avc1mp41
    encoder         : Lavf58.29.100
  Duration: 01:23:05.33, start: 0.000000, bitrate: 3005 kb/s
    Stream #0:0(und): Video: h264 (avc1 / 0x31637661), yuv420p, 1920x1080 [SAR 1:1 DAR 16:9], 2870 kb/s, 30 fps, 30 tbr, 90k tbn, 60 tbc (default)
    Metadata:
      handler_name    : VideoHandler
    Stream #0:1(und): Audio: aac (mp4a / 0x6134706D), 48000 Hz, stereo, fltp, 128 kb/s (default)
    Metadata:
      handler_name    : SoundHandler

PC Specs:

  • i9-10900k water cooled
  • 64 GB Ram @ 3Ghz
  • Ubuntu 21.10 kernel 5.13.0-28-generic
  • nvidia-headless-470
  • nvidia-docker2

Custom build FFMPEG(bethrezen/ffmpeg-rtmp-nvenc docker image(sha256:0c17d7133066e0e305aba51d9abd28a4b07259c8dd719c1d072d073fd3baa6fb) based on Cuda 10.2):

ffmpeg version 4.3.1 Copyright (c) 2000-2020 the FFmpeg developers
  built with gcc 7 (Ubuntu 7.5.0-3ubuntu1~18.04)
  configuration: --enable-version3 --enable-gpl --enable-nonfree --enable-small --enable-libmp3lame --enable-libx264 --enable-libtheora --enable-libvorbis --enable-libopus --enable-libfdk-aac --enable-libass --enable-librtmp --enable-postproc --enable-avresample --enable-libfreetype --enable-openssl --enable-avfilter --enable-libxvid --enable-pic
--enable-shared --enable-pthreads --enable-shared --enable-nvenc --enable-cuda --enable-opencl --enable-cuvid --enable-libnpp --disable-stripping --disable-static --disable-debug --extra-cflags=-I/usr/local/cuda/include --extra-ldflags=-L/usr/local/cuda/lib64
  libavutil      56. 51.100 / 56. 51.100
  libavcodec     58. 91.100 / 58. 91.100
  libavformat    58. 45.100 / 58. 45.100
  libavdevice    58. 10.100 / 58. 10.100
  libavfilter     7. 85.100 /  7. 85.100
  libavresample   4.  0.  0 /  4.  0.  0
  libswscale      5.  7.100 /  5.  7.100
  libswresample   3.  7.100 /  3.  7.100
  libpostproc    55.  7.100 / 55.  7.100

Fullhd -> Fullhd command:

time ffmpeg -y -v info -hwaccel cuvid  -hwaccel_device 0 -c:v h264_cuvid -i ./video.mp4 \
-c:a aac -b:a 128k -c:v h264_nvenc -b:v 3500k -maxrate 3500k -preset fast -profile:v baseline -g 30 -r 30 out_1080.mp4

Transcoding command:

time ffmpeg -y -v info -hwaccel cuvid  -hwaccel_device 0 -c:v h264_cuvid -i ./video.mp4 \
-c:a aac -b:a 128k -c:v h264_nvenc -b:v 3500k -maxrate 3500k -preset fast -profile:v baseline -g 30 -r 30 out_1080.mp4 \
-c:a aac -b:a 128k -c:v h264_nvenc -b:v 2000k -maxrate 2000k -vf hwupload,scale_npp=w=1280:h=720 -preset fast -profile:v baseline -g 30 -r 30  out_720.mp4 \
-c:a aac -b:a 128k -c:v h264_nvenc -b:v 1000k -maxrate 1000k -vf hwupload,scale_npp=w=854:h=480 -preset fast -profile:v baseline -g 30 -r 30  out_480.mp4

NVIDIA RTX A4000

Specs from datasheet:

  • 16 GB GDDR6 256-bit 448 GB/s
  • Cuda Cores (Ampere): 6,144
  • Tensor cores (3rd gen): 192
  • RT Cores (2nd gen): 48
  • Single-precision performance: 19.2 TFLOPS
  • Total power consumption: 140W
  • Encoding streams: unlimited

Additional power: PCIe 6-pin

PCIe slots: 1

Purchase price: $2053

Fullhd -> Fullhd

~ 317 MB VRAM ± 65-74W

frame=149565 fps=825 q=22.0 Lsize= 1703147kB time=01:23:05.46 bitrate=2798.6kbits/s dup=5 drop=0 speed=27.5x
video:1627352kB audio:71575kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.248380%
[aac @ 0x562f54b56c40] Qavg: 7124.271

real    3m1.579s
user    1m8.791s
sys     0m3.517s

Transcoding Fullhd -> Fullhd + HD + SD

~ 379 MB VRAM ± 78-90W

frame=149565 fps=436 q=22.0 Lq=21.0 q=23.0 size= 1703147kB time=01:23:05.46 bitrate=2798.6kbits/s dup=15 drop=0 speed=14.5x
video:3157909kB audio:214724kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: unknown
[aac @ 0x55c0d10231c0] Qavg: 7124.271
[aac @ 0x55c0d101ed80] Qavg: 7124.271
[aac @ 0x55c0d1034280] Qavg: 7124.271

real    5m43.207s
user    3m14.678s
sys     0m4.525s

NVIDIA RTX A2000

Specs from datasheet:

  • 6 GB GDDR6 192-bit 288 GB/s
  • Cuda Cores (Ampere): 3,328
  • Tensor cores (3rd gen): 104
  • RT Cores (2nd gen): 26
  • Single-precision performance: 8.0 TFLOPS
  • Total power consumption: 70W
  • Encoding streams: unlimited

Additional power: not required

PCIe slots: 2

Purchase price: $964

Fullhd -> Fullhd

~ 256 MB VRAM ± 55W

frame=149565 fps=816 q=22.0 Lsize= 1703147kB time=01:23:05.46 bitrate=2798.6kbits/s dup=5 drop=0 speed=27.2x
video:1627352kB audio:71575kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.248380%
[aac @ 0x5619cec43cc0] Qavg: 7124.271

real    3m3.436s
user    1m7.428s
sys     0m2.941s

UPD8: Launching the same test on i9-12900K CPU but on PCIe 3.0 4x lane resulted in 8% fps penalty. Seems like PCIe bandwidth matters a lot here.

Transcoding Fullhd -> Fullhd + HD + SD

~ 317 MB VRAM ± 65W

frame=149565 fps=460 q=22.0 Lq=21.0 q=23.0 size= 1703147kB time=01:23:05.46 bitrate=2798.6kbits/s dup=15 drop=0 speed=15.3x
video:3157909kB audio:214724kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: unknown
[aac @ 0x5596ac0e11c0] Qavg: 7124.271
[aac @ 0x5596ac0dcd80] Qavg: 7124.271
[aac @ 0x5596ac0f2280] Qavg: 7124.271

real    5m25.119s
user    2m51.941s
sys     0m4.682s

RTX 4000

Specs from datasheet:

  • 8 GB GDDR6 192-bit 416 GB/s
  • Cuda Cores (Turing): 2,304
  • Tensor cores (? gen): 288
  • RT Cores (? gen): 36
  • Single-precision performance: 7.1 TFLOPS
  • Total power consumption: 125W
  • Encoding streams: unlimited

Additional power: PCIe 6+2-pin

PCIe slots: 1

Purchase price: $1287

Fullhd -> Fullhd

~ 256 MB VRAM ± 61W

frame=149565 fps=876 q=22.0 Lsize= 1737670kB time=01:23:05.46 bitrate=2855.3kbits/s dup=5 drop=0 speed=29.2x
video:1661875kB audio:71575kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.243433%
[aac @ 0x55d81d057cc0] Qavg: 7124.271

real    2m51.125s
user    1m7.820s
sys     0m3.456s

Transcoding Fullhd -> Fullhd + HD + SD

~ 315 MB VRAM ± 70W

frame=149565 fps=492 q=22.0 Lq=21.0 q=23.0 size= 1737670kB time=01:23:05.46 bitrate=2855.3kbits/s dup=15 drop=0 speed=16.4x
video:3214019kB audio:214724kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: unknown
[aac @ 0x562a5beda1c0] Qavg: 7124.271
[aac @ 0x562a5bed5d80] Qavg: 7124.271
[aac @ 0x562a5beeb280] Qavg: 7124.271

real    5m4.469s
user    2m48.625s
sys     0m4.851s

NVIDIA T1000

Specs from datasheet:

  • 4 GB GDDR6 128-bit 160 GB/s
  • Cuda Cores (Turing): 896
  • Tensor cores: none
  • RT Cores: none
  • Single-precision performance: <= 2.5 TFLOPS
  • Total power consumption: 50W
  • Encoding streams: 3

Additional power: not required

PCIe slots: 1

Purchase price: $435

Fullhd -> Fullhd

~ 209 MB VRAM <= 50W (nvidia-smi reports N/A)

frame=149565 fps=732 q=21.0 Lsize= 1732661kB time=01:23:05.46 bitrate=2847.1kbits/s dup=5 drop=0 speed=24.4x
video:1656867kB audio:71575kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.244139%
[aac @ 0x55f4f4a0ecc0] Qavg: 7124.271

real    3m24.583s
user    1m12.531s
sys     0m3.221s

Transcoding Fullhd -> Fullhd + HD + SD

~ 273 MB VRAM <= 50W (nvidia-smi reports N/A)

frame=149565 fps=405 q=21.0 Lq=20.0 q=22.0 size= 1732661kB time=01:23:05.46 bitrate=2847.1kbits/s dup=15 drop=0 speed=13.5x
video:3269555kB audio:214724kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: unknown
[aac @ 0x564098f571c0] Qavg: 7124.271
[aac @ 0x564098f52d80] Qavg: 7124.271
[aac @ 0x564098f68280] Qavg: 7124.271

real    6m9.938s
user    3m38.598s
sys     0m5.236s

NVIDIA RTX 2080 Ti

Disclaimer: this test was made on the same video but on a bit different server with i9-12900K CPU and 3600 Mhz RAM. Driver version: 510.47.03

Specs from datasheet:

  • 11 GB GDDR6 352-bit 616 GB/s
  • Cuda Cores (Turing): 4352
  • Tensor cores: 544 ?
  • RT Cores: 68 ?
  • Single-precision performance: 13.45 TFLOPS?
  • Total power consumption: 260W
  • Encoding streams: 3

Additional power: PCIe 2x 6+2-pin

PCIe slots: 3

Purchase price: N/A

Fullhd -> Fullhd

~ 370 MB VRAM ±93W

frame=149565 fps=911 q=22.0 Lsize= 1737670kB time=01:23:05.46 bitrate=2855.3kbits/s dup=5 drop=0 speed=30.4x
video:1661875kB audio:71575kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.243433%
[aac @ 0x55a480458c40] Qavg: 7124.271

real    2m44.401s
user    1m21.541s
sys     0m2.731s

Transcoding Fullhd -> Fullhd + HD + SD

~ 370 MB VRAM ±104W

frame=149565 fps=521 q=22.0 Lq=21.0 q=23.0 size= 1737670kB time=01:23:05.46 bitrate=2855.3kbits/s dup=15 drop=0 speed=17.4x
video:3214019kB audio:214724kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: unknown
[aac @ 0x55ea220f81c0] Qavg: 7124.271
[aac @ 0x55ea220f3d80] Qavg: 7124.271
[aac @ 0x55ea22109280] Qavg: 7124.271

real    4m47.268s
user    3m17.701s
sys     0m4.718s

Quadro P4000

Disclaimer: this test was made on the same video but on a bit different server with AMD Ryzen 9 3950X and some load on it(but no nvenc load). Driver version: 450.119.03

Specs from datasheet:

  • 8 GB GDDR5 256-bit 243 GB/s
  • Cuda Cores (Pascal): 1792
  • Tensor cores: none
  • RT Cores: none
  • Single-precision performance: 5.2 TFLOPS
  • Total power consumption: 105W
  • Encoding streams: unlimited

Additional power: PCIe 6-pin

PCIe slots: 1

Purchase price: N/A

Fullhd -> Fullhd

~ 234 MB VRAM ±43W

frame=149565 fps=644 q=21.0 Lsize= 1732661kB time=01:23:05.46 bitrate=2847.1kbits/s dup=5 drop=0 speed=21.5x
video:1656867kB audio:71575kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.244139%
[aac @ 0x56269dd00c40] Qavg: 7124.271

real	3m52.991s
user	1m38.497s
sys	0m3.769s

Transcoding Fullhd -> Fullhd + HD + SD

~ 294 MB VRAM ±50W

frame=149565 fps=384 q=21.0 Lq=20.0 q=22.0 size= 1732661kB time=01:23:05.46 bitrate=2847.1kbits/s dup=15 drop=0 speed=12.8x
video:3269555kB audio:214724kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: unknown
[aac @ 0x55cfae2e21c0] Qavg: 7124.271
[aac @ 0x55cfae2ddd80] Qavg: 7124.271
[aac @ 0x55cfae2f3280] Qavg: 7124.271

real	6m30.079s
user	3m59.463s
sys	0m5.913s

Conclusion

For video encoding NVIDIA T1000 (while having only 3 encoding streams) is the perfect cost/fps. If you need some real CUDA and/or Tensor cores - NVIDIA RTX A2000 is your choice. They both doesn't require additional PCIe power connectors.

For home-usage in some little machine learning things you choose between cheaper 8GB RTX 4000 and more expensive 16GB NVIDIA RTX A4000.

References

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment