Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save Brainiarc7/4636a162ef7dc2e8c9c4c1d4ae887c0e to your computer and use it in GitHub Desktop.
Save Brainiarc7/4636a162ef7dc2e8c9c4c1d4ae887c0e to your computer and use it in GitHub Desktop.
This gist will show you how to livestream your Linux desktop to a client via FFMpeg using a GPU-accelerated video encoder (NVENC and VAAPI-based)

Low-Latency Live Streaming for your Desktop using ffmpeg and netcat:

Preamble:

In this post I will explore how to stream a video and audio capture from one computer to another using ffmpeg and netcat, with a latency below 100ms, which is good enough for presentations and general purpose remote display tasks on a local network.

The problem:

Streaming low-latency live content is quite hard, because most software-based video codecs are designed to achieve the best compression and not best latency. This makes sense, because most movies are encoded once and decoded often, so it is a good trade-off to use more time for the encoding than the decoding.

However, some encoders, particularly the NVENC-based h264_nvenc and hevc_nvenc are very good at handling low-latency encoding situations (as they have dedicated encoder presets tuned to low latency encoding), and they are the perfect solution to this problem.

I wrote a quick script in python that would output the current time in milliseconds to measure the system's and the encoder latency encountered:

#!/usr/bin/python3
import time
import sys
while True:
    time.sleep(0.001)
    print('%s\r' % (int(time.time() * 1000) % 10000), end='')
    sys.stdout.flush()

Using this script, you can then encode and decode the stream on your desktop at the same time. Making a screenshot of both the original desktop and the streamed desktop next to it gives you the total video encode latency.

Solution:

We will be using ffmpeg for the desktop and audio capture on the local system:

ffmpeg -loglevel debug \
    -f x11grab -s 1920x1080 -framerate 60 -i :0.0 \
-thread_queue_size 1024 -f alsa -ac 2 -ar 44100 -i hw:Loopback,1,0 \
    -c:v h264_nvenc -preset:v llhq  \
    -rc:v vbr_minqp -qmin:v 19  \
    -f mpegts - | nc -l -p 9000

To capture without audio:

ffmpeg -loglevel debug \
        -f x11grab -s 1920x1080 -framerate 60 -i :0.0 \
        -c:v h264_nvenc -preset:v llhq  \
        -rc:v vbr_minqp -qmin:v 19  \
        -f mpegts - | nc -l -p 9000

For further NVENC encoder options, check out this guide.

Note: Ensure that you select the appropriate video resolution (see the -s option passed to ffmpeg), otherwise the capture will fail. Most commodity laptops have a resolution of _1366_768*, and in such a case, you'd set ffmpeg to capture such a display via -s 1366x768 .

Re-scaling to 1080p is possible, merely insert a video filter into the encode as illustrated here for NVENC and here for VAAPI.

With HEVC: On supported hardware, you may want to capture with the HEVC codec for lower bitrates (Nvidia Maxwell Gen 2 and above, see this for a guide and Intel Skylake and above, refer to ffmpeg -h encoder=hevc_vaapi output to tune the encoder. If you follow the HEVC path, ensure that the receiving device also supports HEVC decode in hardware (Most modern low-cost ARM development boards such as the Odroid-C2 support the feature).

A special note on Nvidia Optimus systems:

On Nvidia Optimus system, capturing the X11 display with ffmpeg as shown above will not work because all output is wired from the iGPU. If this is your case, use an alternate method such as SSR or Open Broadcaster Studio, and alternatively, the Intel QuickSync method (via VAAPI) as shown below. This limit can be overcome if:

(a). The Nvidia GPU is wired to an output panel, as is the case with some HP Z-Book SKUs.
(b). Nvidia Optimus can be safely disabled in the BIOS settings of the notebook such that only the Nvidia GPU is available for use. Note that this condition is often dependent on (a) above in most configurations.

On such systems, its' best to capture from the first KMS device as the Intel GPU is the primary card on such systems, as shown below:

ffmpeg -loglevel debug \
    -device /dev/dri/card0 -f kmsgrab -i - \
-thread_queue_size 1024 -f alsa -ac 2 -ar 44100 -i hw:Loopback,1,0 \
    -c:v h264_nvenc -preset:v llhq  \
    -rc:v vbr_minqp -qmin:v 19  \
    -f mpegts - | nc -l -p 9000

To capture without audio:

ffmpeg -loglevel debug \
        -device /dev/dri/card0 -f kmsgrab -i - \
        -c:v h264_nvenc -preset:v llhq  \
        -rc:v vbr_minqp -qmin:v 19  \
        -f mpegts - | nc -l -p 9000

On capturing the Audio stream from a running application:

Note that on the provided examples, we are capturing audio output from a running application on the desktop using the snd_aloop module. Read more on its' usage here so you can tune it as you see fit.

Note that you must direct the audio to the new loopback device, otherwise it won't be captured. Use a tool such as pavucontrol to set the default output device on the host.

FFmpeg has an excellent Wiki covering the same audio capture topic here.

For the module configuration options on boot, see the "Setting up modprobe and kmod support" section here.

On the client:

We will use netcat (nc) and mplayer to view (play back) the generated live stream from the remote host:

nc <host_ip_address> 9000 | mplayer -benchmark -

You can even save the results of the livestream (on the remote host) to an MP4 file or any other container format supported by ffmpeg if you so desire:

nc <host_ip_address> 9000 | tee  file_containing_the_video.mp4 | mplayer - -benchmark

Note: It is advisable to use the -benchmark flag on the client-side. -framedrop might help as well, especially on slower clients where video decode may present a challenge. Ensure that netcat's specified ports are open on the firewalls on both the local and the remote hosts, and that both the local and the remote netcat instances are using the same port.

Experimental: Using h264_vaapi's encoder on Intel-based hardware:

If you have a supported SKU (a system with an Intel Core or supported Pentium/Atom or Core-M with integrated Ivybridge, Haswell, Broadwell, Skylake or higher GPU), you may also use the VAAPI-based encoders (h264_vaapi and h265_vaapi where supported) as you see fit:

ffmpeg -loglevel debug \
-device /dev/dri/card0 -f kmsgrab -i - \
-thread_queue_size 1024 -f alsa -ac 2 -ar 44100 -i hw:Loopback,1,0 \
-vaapi_device /dev/dri/renderD128 -vf 'hwmap=derive_device=vaapi,scale_vaapi=w=1920:h=1080:format=nv12' \
-c:v h264_vaapi -qp:v 19 -bf 4 -threads 4 -aspect 16:9 \
-f mpegts - | nc -l -p 9000

And to capture the screen without audio:

ffmpeg -loglevel debug -thread_queue_size 512  \
    -device /dev/dri/card0 -f kmsgrab -i - \
    -vaapi_device /dev/dri/renderD128 -vf 'hwmap=derive_device=vaapi,scale_vaapi=w=1920:h=1080:format=nv12' \
    -c:v h264_vaapi -qp:v 19 -bf 4 -threads 4 -aspect 16:9 \
    -f mpegts - | nc -l -p 9000

Note that we are capturing the screen from the first active KMS plane, which is driven by the Intel GPU.

Depending on your platform's hardware, your video quality may differ greatly. As a rule of thumb, Sandybridge would give you considerably worse video quality, whereas Haswell and above have greatly improved QuickSync engines, and thus trade off quality for encoder performance. In the same vein, encoder latency for QuickSync may be higher than that encountered with the likes of NVENC, as tested on a multi-GPU system at my disposal.

AMD's hardware with a VCE block is also supported, on both the radeon(si) and amdgpu driver. Ensure that you're running the latest linux-firmware package so that your GPU's firmware can be loaded to initalize the VCE block.

Extra tips:

If you have lower network bandwidth and or a much weaker processor and GPU combination (Intel's case applies here), you can halve the frame rate to 30 albeit at higher latency spikes. Using a software-based encoder implementation (say, libx264) will always result in a higher quality at the same preset with a much higher system load.

If you want to try to tweak this setup even further, you can pipe the host directly into the client instead of using the network, using the -quiet option of mplayer to see what the encoder is up to.

nc -l -u -p 9000 <host-ip> | mplayer -quiet - -benchmark 

Have fun out there :-)

PS: You may refer to netcat's advanced usage options here.

@Arcitec
Copy link

Arcitec commented Dec 1, 2021

This is a seriously awesome guide. Thanks @Brainiarc7 !

Your solution may be the lowest latency one on Linux. You get 100ms. I saw OBS "low latency tuning" guides that only achieve 500ms.

NVIDIA themselves, in their totally native GameStream feature on Windows, achieves a latency between 20ms-200ms (usually sits around 50-60ms) with pure hardware and the most efficient path possible.

So the fact that you're getting close to Windows NVIDIA driver speeds, is very impressive!

I read up on thread_queue_size and it seems to adjust how much buffering of incoming/unencoded data FFmpeg does. It needs raising if the computer can't keep up with encoding in realtime. But I wonder if latency is reduced by lowering it, if the computer is very fast. Depends on whether the number means "queue this much data before we encode" or if it means "this is just how tbhe max amount of allowed storage space in the buffer for incoming frames". I suspect that it may be the latter. If so, latency won't change if we lower this value. But I am gonna try it.

I think the default value is 8, and you're using 1024 here. All I know so far is that it needs to be high enough to avoid the ffmpeg errors that say "Thread message queue blocking; consider raising the thread_queue_size option (current value: 8)".

The other thing I'm gonna have to investigate is how to capture pipewire audio in ffmpeg, since I don't use pulseaudio. But I expect that to be easy! :) If not, I can always use pulseaudio since pipewire contains pulseaudio client/server emulation.

@Brainiarc7
Copy link
Author

Brainiarc7 commented Dec 1, 2021 via email

@Arcitec
Copy link

Arcitec commented Dec 1, 2021

@Brainiarc7 Have you tried the nc -u flag to force output via UDP for even lower latency?

@Brainiarc7
Copy link
Author

@Bananaman not yet, I'll definitely be doing that and will report back with results!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment