WireGuard / NetBird Mesh Throughput Tuning — A Field Guide

A practical, end-to-end walkthrough of diagnosing and tuning a NetBird (WireGuard-based) mesh on Azure VMs, based on a real session of measurement → bottleneck identification → tuning → validation.

Environment used in this guide:

2× Azure VMs, 4 vCPU each
NIC: Mellanox ConnectX (mlx5) with Accelerated Networking (SR-IOV)
Hyper-V hypervisor
NetBird overlay using kernel WireGuard (wt0 interface)
CGNAT-range mesh IPs (100.64.0.0/10)

1. Conceptual Foundations

1.1 Throughput tests vs. load tests

iperf3 -c <peer> -t 30 -P 8 is not a load test. It's a bandwidth/throughput measurement. The distinction matters:

What iperf3 measures	What iperf3 does NOT measure
Maximum sustained TCP bitrate	Request/response latency (p50/p95/p99)
Per-flow CPU ceiling	Connection setup cost (TCP handshake, TLS)
Path capacity through tunnels	Application-layer behavior (HTTP, gRPC)
Retransmits and cwnd evolution	Connection churn / conntrack pressure

Use iperf3 as the baseline — it tells you the ceiling of the path. Real load testing (wrk2, vegeta, k6, ghz) builds on top of that baseline.

1.2 Why WireGuard is a special case

WireGuard has architectural constraints that influence every tuning decision:

Per-peer crypto serialization: ChaCha20-Poly1305 is processed under a per-peer mutex. Single-flow throughput is bounded by one CPU core, regardless of how many you have.
Encapsulation overhead: Each packet gets a WireGuard header + Poly1305 auth tag. Small packets (e.g., 64-byte UDP) suffer disproportionately — PPS ceiling drops sharply.
Single UDP flow on the wire: Even with iperf3 -P 8 (8 TCP streams), the underlay sees one UDP src/dst port pair. NIC RSS hashing lands them all on the same RX queue → same IRQ → same core.
Virtual interface, no hardware IRQ: wt0 itself has no IRQ. Hardware IRQs are on the underlay NIC (eth0). Tuning IRQ affinity on wt0 is meaningless; tune the underlay.

These four facts dictate the entire tuning playbook below.

2. Establishing a Baseline

Before any tuning, capture multiple dimensions of behavior. A single number lies; a profile tells the truth.

2.1 Multi-scenario throughput baseline

# TCP forward
iperf3 -c <peer> -t 60 -P 8 -J > tcp_fwd.json

# TCP reverse (asymmetric paths matter, especially through NVAs)
iperf3 -c <peer> -t 60 -P 8 -R -J > tcp_rev.json

# UDP — exposes packet loss, jitter, MTU issues
iperf3 -c <peer> -t 60 -u -b 0 -l 1200 -J > udp.json

# Small-packet PPS test — overlays die here
iperf3 -c <peer> -t 60 -u -b 0 -l 64

# Single-flow ceiling — the per-core crypto limit
iperf3 -c <peer> -t 60 -P 1

2.2 Why each test matters

-P 1 vs -P 8 delta tells you per-flow CPU bottleneck severity.
TCP vs UDP with same payload: gap reveals TCP stack overhead vs raw path capacity.
-l 64 vs -l 1200 delta exposes encapsulation overhead pain.
Forward vs reverse uncovers asymmetric NIC offloads, NAT path differences, or one-direction queue saturation.

3. Diagnosing the Bottleneck

3.1 The four primary observation points

Run these simultaneously during a test in separate terminals:

# Per-core CPU breakdown — find which core is pinned and on what
mpstat -P ALL 1

# Hardware IRQ distribution — find which IRQ is hot
watch -n 1 'cat /proc/interrupts | grep -E "CPU|mlx5|eth|virtio"'

# Software IRQ distribution — confirms softirq saturation per core
watch -n 1 'cat /proc/softirqs | grep -E "CPU|NET_RX|NET_TX"'

# WireGuard internals
watch -n 1 'wg show all dump'

3.2 Reading `mpstat`

The columns that matter for network tuning:

Column	Meaning	When it's the bottleneck
`%soft`	Time in software interrupts (network stack work)	Single core hits ~90%+
`%sys`	Kernel mode (syscalls, crypto, copies)	All cores high → CPU-bound globally
`%usr`	User mode (iperf3 itself)	Rare bottleneck for network tests
`%idle`	Free capacity	Low idle on one core = local bottleneck

Real example from the session:

CPU    %usr    %sys    %soft   %idle
0      1.10    47.25   6.59    45.05
1      0.00    5.43    88.04   6.52    ← bottleneck: softirq pinned
2      1.16    46.51   8.14    44.19
3      2.27    46.59   5.68    45.45

CPU1 was at %soft 88% while three other cores sat half-idle. The work was not distributed.

3.3 IRQ delta is more useful than IRQ totals

/proc/interrupts shows cumulative counts since boot — useless for live diagnosis. Take a delta during load:

cat /proc/interrupts | grep mlx5_comp > /tmp/irq1.txt
sleep 5
cat /proc/interrupts | grep mlx5_comp > /tmp/irq2.txt
diff /tmp/irq1.txt /tmp/irq2.txt

The IRQ whose count is climbing fastest, and the CPU column it's climbing in, is the live hot spot.

3.4 Verify WireGuard mode (kernel vs userspace)

Userspace wireguard-go is 3-5× slower than the kernel module. No tuning recovers this gap.

lsmod | grep wireguard         # kernel module loaded?
ip link show wt0               # type should mention wireguard, not tun
ps -eLf | grep wg-crypt        # kernel WG spawns wg-crypt-<iface> threads

4. The Tuning Playbook

Apply changes one at a time and re-measure. Multi-change tuning sessions can't attribute wins to specific knobs.

4.1 NIC multi-queue

# Check current state
ethtool -l eth0

# Enable max queues
sudo ethtool -L eth0 combined 4   # match vCPU count

Multi-queue alone doesn't help if all queues' IRQs land on one core — see §4.2.

4.2 IRQ affinity (manual pinning)

irqbalance often does the wrong thing on virtio_net and mlx5 in cloud VMs (it tends to power-save by collapsing IRQs onto a single core). Disable and pin manually:

sudo systemctl stop irqbalance
sudo systemctl disable irqbalance

# Distribute mlx5 RX/TX IRQs across cores
IRQS=$(grep 'mlx5_comp' /proc/interrupts | awk '{print $1}' | tr -d ':')
CPU=0
for irq in $IRQS; do
  MASK=$(printf '%x' $((1 << CPU)))
  echo $MASK | sudo tee /proc/irq/$irq/smp_affinity
  CPU=$(( (CPU + 1) % $(nproc) ))
done

Note for virtio_net: TX/RX share IRQs (combined queues). Only virtio*-input.N IRQs exist; there are no separate output IRQs. Don't search for them.

Note for mlx5: IRQs appear as mlx5_comp0, mlx5_comp1, etc. The mlx5_async0 and mlx5_comp0 for control plane should usually be left alone.

4.3 RPS (Receive Packet Steering) — software-level distribution

When RSS can't help (single UDP flow → single queue, as with WireGuard), RPS distributes work in software based on packet hashing:

# Determine CPU count and build mask
CPU_COUNT=$(nproc)
MASK=$(printf '%x' $(( (1 << CPU_COUNT) - 1 )))

# Apply to all RX queues on the underlay NIC
for q in /sys/class/net/eth0/queues/rx-*/rps_cpus; do
  echo $MASK | sudo tee $q
done

# RFS — flow-aware steering, keeps connection on same CPU as its socket consumer
echo 32768 | sudo tee /proc/sys/net/core/rps_sock_flow_entries
for q in /sys/class/net/eth0/queues/rx-*/rps_flow_cnt; do
  echo 4096 | sudo tee $q
done

# Also apply to wt0 (the WireGuard virtual interface)
echo $MASK | sudo tee /sys/class/net/wt0/queues/rx-0/rps_cpus
echo 4096 | sudo tee /sys/class/net/wt0/queues/rx-0/rps_flow_cnt

Common pitfall: writing ff for a 4-CPU system fails with Value too large for defined data type. The mask must match nproc:

CPU count	Mask
2	`3`
4	`f`
8	`ff`
16	`ffff`

4.4 RSS hash configuration

If you do have multiple UDP flows (e.g., multiple WireGuard peers), make sure the NIC hashes on ports too, not just IPs:

# Inspect current
ethtool -n eth0 rx-flow-hash udp4

# Include source/destination ports in hash
sudo ethtool -N eth0 rx-flow-hash udp4 sdfn

# Verify indirection table is even
ethtool -x eth0
sudo ethtool -X eth0 equal $(nproc)

sdfn = Source IP, Destination IP, Source port, destinatioN port.

4.5 Backlog and budget — preventing softirq drops

When RPS distributes packets, it queues them onto each target CPU's backlog. If the backlog overflows, packets are silently dropped. This was the root cause of drops appearing after enabling RPS in the session that produced this guide.

Symptoms: column 2 of /proc/net/softnet_stat (decimal: awk '{print strtonum("0x"$2)}') climbing during load.

sudo sysctl -w net.core.netdev_max_backlog=10000
sudo sysctl -w net.core.netdev_budget=600
sudo sysctl -w net.core.netdev_budget_usecs=8000

Knob	Default	Tuned	Purpose
`netdev_max_backlog`	1000	10000	Per-CPU backlog queue depth
`netdev_budget`	300	600	Packets per softirq run
`netdev_budget_usecs`	2000	8000	Max time per softirq run

These three knobs are what fixed the drop regression after enabling RPS.

4.6 NIC ring buffers

If drops appear in ethtool -S as rx_no_buffer, rx_missed, or rx_dropped, the NIC's hardware queue is overflowing before the kernel reads it.

ethtool -g eth0
# If Current << Pre-set maximum:
sudo ethtool -G eth0 rx 4096 tx 4096

4.7 TCP buffer sizing

Larger buffers help bandwidth-delay product (BDP) saturation, but too large causes memory pressure and packet drops. For 1-3 Gbps WireGuard scenarios, 16 MB cap is appropriate. Do not blindly set 128 MB — it caused regressions in the session.

sudo sysctl -w net.core.rmem_max=16777216
sudo sysctl -w net.core.wmem_max=16777216
sudo sysctl -w net.ipv4.tcp_rmem="4096 131072 16777216"
sudo sysctl -w net.ipv4.tcp_wmem="4096 16384 16777216"

4.8 BBR + fq

BBR generally outperforms CUBIC over overlay tunnels (less retransmit-driven backoff). It requires fq qdisc:

sudo modprobe tcp_bbr
sudo sysctl -w net.core.default_qdisc=fq
sudo sysctl -w net.ipv4.tcp_congestion_control=bbr

# Verify
sysctl net.ipv4.tcp_congestion_control

4.9 MTU sizing

The single largest gain available, and the most overlooked. NetBird default MTU is often 1280; underlay typically supports 1500. WireGuard adds ~60 bytes overhead, so wt0 can usually be 1420.

# Probe path MTU with don't-fragment ping
ping -M do -s 1372 <peer-ip>   # 1372 + 28 = 1400 payload
ping -M do -s 1412 <peer-ip>   # try larger

# Apply
sudo ip link set wt0 mtu 1420

# Or persist via NetBird config
netbird up --mtu 1420

5. Drop Diagnosis Reference

When drops appear, find the layer before changing knobs.

5.1 The four drop counters that matter

# 1. NIC hardware drops
ethtool -S eth0 | grep -iE 'drop|miss|error|discard' | grep -v ': 0$'

# 2. Kernel softirq backlog drops (column 2 of softnet_stat is hex)
awk 'BEGIN {sum=0} {sum += strtonum("0x"$2)} END {print "softnet_dropped:", sum}' /proc/net/softnet_stat

# 3. Interface-level drops
ip -s link show eth0
ip -s link show wt0

# 4. UDP-layer errors (relevant for WireGuard underlay)
nstat -a | grep -iE 'udp.*err|udp.*drop|noport'

5.2 Mapping drop location → fix

Drop counter increasing	Likely cause	Fix
`rx_no_buffer`, `rx_missed` (NIC)	Ring buffer too small	`ethtool -G eth0 rx 4096`
`softnet_stat` column 2	RPS backlog overflow	`netdev_max_backlog`, `netdev_budget`
`udp_in_errors` (nstat)	UDP socket buffer full	`net.core.rmem_max`, app-level socket size
`tx_dropped` (interface)	Qdisc queue overflow	Larger qdisc limit, or `fq` instead of `pfifo_fast`
Conntrack `insert_failed`	Conntrack table full	`nf_conntrack_max`

5.3 Before/after quantification

Always quantify the change. The session pattern:

# Snapshot before
DROP1=$(awk '{sum += strtonum("0x"$2)} END {print sum}' /proc/net/softnet_stat)
NIC1=$(ethtool -S eth0 | awk '/dropped|missed/ {sum+=$2} END {print sum}')

# Run test
iperf3 -c <peer> -t 30 -P 8

# Snapshot after
DROP2=$(awk '{sum += strtonum("0x"$2)} END {print sum}' /proc/net/softnet_stat)
NIC2=$(ethtool -S eth0 | awk '/dropped|missed/ {sum+=$2} END {print sum}')

echo "softnet delta: $((DROP2-DROP1))"
echo "nic delta:     $((NIC2-NIC1))"

A few drops in a 30-second test is normal. Thousands per second is a real problem.

6. Counterintuitive Observations Worth Internalizing

6.1 "CPU usage went up but throughput also went up — is that good?"

Yes, this is the desired outcome. When one core was at %soft 96% while three others were at %soft 8%, you weren't using ~75% of available capacity. Distributing the work means more cores are busy — not because there's "more" work, but because previously-idle cores are now contributing.

The question to ask is not "is CPU high?" but "is throughput improving and are drops staying low?" If yes, the tuning is working.

6.2 RPS doesn't always help WireGuard

WireGuard's per-peer mutex means single-flow throughput has a hard ceiling at one core's crypto rate, no matter what you do at the kernel network layer. RPS distributes packet handling, but not the encryption itself. If you're testing single-peer single-flow and seeing one core saturated on %sys (not %soft), that's the WireGuard crypto thread — and it's a wall.

The only ways past it: more peers, larger MTU (fewer packets per byte), or a faster cipher (not user-tunable in WireGuard).

6.3 `irqbalance` is often actively harmful

On mlx5 in Hyper-V/Azure, on virtio_net in cloud VMs generally, irqbalance tends to collapse IRQs onto a single core for power-save reasons. If you see one CPU's IRQ count vastly higher than others despite multi-queue being enabled, suspect irqbalance first. Stop, disable, and pin manually.

6.4 `-P 8` does not mean 8 parallel flows on the wire

iperf3's -P 8 opens 8 TCP streams. After WireGuard encapsulation, all 8 ride a single UDP flow (one src port, one dst port). RSS sees one 5-tuple, hashes to one queue. To get true multi-flow on the underlay you need either multiple WireGuard peers or multiple iperf3 instances on different UDP ports — and even then NetBird may multiplex through one tunnel.

6.5 IRQ counts are cumulative — always take a delta

/proc/interrupts totals are since boot. A "hot" CPU in the totals may simply have been hot once, hours ago. Sample twice during load, diff, and look at the rate of change.

7. Recommended Workflow

Baseline: run all five iperf3 scenarios, save JSON.
Identify: mpstat + IRQ delta + softirq delta during load.
Hypothesize: name the bottleneck before changing anything.
Change one knob, re-run a single scenario, compare.
Watch drops: every change can introduce them. Quantify.
Record: build a comparison table of (config version × throughput × drops × p99 CPU).
Stop when: throughput plateau is reached, drops are minimal, and CPU is reasonably distributed across cores.

7.1 Sample comparison table

Config	Throughput	Top core %soft	softnet drops/30s	NIC drops/30s
Defaults	X	96% (1 core)	0	0
+ IRQ affinity	Y	88% (1 core)	0	0
+ RPS/RFS	Z	35% (avg)	many	low
+ backlog/budget	Z'	35% (avg)	0	0
+ MTU 1420	Z''	similar	0	0

This is what you submit as evidence that tuning succeeded.

8. Persistence (Beyond the Session)

Manual settings are lost on reboot. To make them stick:

8.1 Sysctl

sudo tee /etc/sysctl.d/99-netbird-tuning.conf <<EOF
net.core.netdev_max_backlog = 10000
net.core.netdev_budget = 600
net.core.netdev_budget_usecs = 8000
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 131072 16777216
net.ipv4.tcp_wmem = 4096 16384 16777216
net.core.rps_sock_flow_entries = 32768
net.core.default_qdisc = fq
net.ipv4.tcp_congestion_control = bbr
EOF
sudo sysctl --system

8.2 RPS, IRQ affinity, ring buffers

These are not sysctl — wrap them in a systemd unit:

# /etc/systemd/system/network-tuning.service
[Unit]
Description=Network tuning (RPS, IRQ affinity, ring buffers)
After=network-online.target netbird.service
Wants=network-online.target

[Service]
Type=oneshot
ExecStart=/usr/local/sbin/network-tune.sh
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target

Where /usr/local/sbin/network-tune.sh contains the IRQ-pinning and RPS-write logic.

8.3 NetBird MTU

Put --mtu 1420 (or your tested value) into the NetBird daemon config or unit file so it's applied on every netbird up.

9. Quick-Reference Cheat Sheet

# === Live diagnostics ===
mpstat -P ALL 1                                    # per-core load
watch -n 1 'cat /proc/softirqs | grep NET_RX'      # softirq distribution
ethtool -S eth0 | grep -iE 'drop|miss'             # NIC drops
awk '{print strtonum("0x"$2)}' /proc/net/softnet_stat  # backlog drops

# === Quick tuning sequence ===
sudo systemctl stop irqbalance
sudo ethtool -L eth0 combined $(nproc)
sudo ethtool -G eth0 rx 4096 tx 4096

CPU=0; for i in $(grep mlx5_comp /proc/interrupts | awk '{print $1}' | tr -d ':'); do
  printf '%x' $((1<<CPU)) | sudo tee /proc/irq/$i/smp_affinity
  CPU=$(((CPU+1)%$(nproc)))
done

MASK=$(printf '%x' $(( (1 << $(nproc)) - 1 )))
for q in /sys/class/net/eth0/queues/rx-*/rps_cpus; do echo $MASK | sudo tee $q; done

sudo sysctl -w net.core.netdev_max_backlog=10000
sudo sysctl -w net.core.netdev_budget=600
sudo sysctl -w net.core.netdev_budget_usecs=8000

# === Validation ===
iperf3 -c <peer> -t 30 -P 8

10. When Tuning Has Reached Its Limit

You've hit the end of the road when:

All cores are 60-90% busy during load (work is distributed)
Drops are zero or trivial (sub-100 per 30s test)
wg show reports steady transfer, no handshake churn
Single-flow throughput equals one core's crypto rate
Multi-flow throughput approaches the NIC's underlay bandwidth

Beyond this, the next gains require:

More peers to break per-peer crypto serialization
Larger MTU on underlay (jumbo frames in datacenter, not generally available in cloud)
Faster hardware — newer CPU with AES-NI/AVX gains, or different VM SKU with more vCPU
Different topology — terminating tunnels on dedicated NVAs instead of workload nodes

At that point you're no longer tuning; you're architecting.

WoodProgrammer/README.md