A practical, end-to-end walkthrough of diagnosing and tuning a NetBird (WireGuard-based) mesh on Azure VMs, based on a real session of measurement → bottleneck identification → tuning → validation.
Environment used in this guide:
- 2× Azure VMs, 4 vCPU each
- NIC: Mellanox ConnectX (
mlx5) with Accelerated Networking (SR-IOV) - Hyper-V hypervisor
- NetBird overlay using kernel WireGuard (
wt0interface) - CGNAT-range mesh IPs (
100.64.0.0/10)
iperf3 -c <peer> -t 30 -P 8 is not a load test. It's a bandwidth/throughput measurement. The distinction matters:
| What iperf3 measures | What iperf3 does NOT measure |
|---|---|
| Maximum sustained TCP bitrate | Request/response latency (p50/p95/p99) |
| Per-flow CPU ceiling | Connection setup cost (TCP handshake, TLS) |
| Path capacity through tunnels | Application-layer behavior (HTTP, gRPC) |
| Retransmits and cwnd evolution | Connection churn / conntrack pressure |
Use iperf3 as the baseline — it tells you the ceiling of the path. Real load testing (wrk2, vegeta, k6, ghz) builds on top of that baseline.
WireGuard has architectural constraints that influence every tuning decision:
- Per-peer crypto serialization: ChaCha20-Poly1305 is processed under a per-peer mutex. Single-flow throughput is bounded by one CPU core, regardless of how many you have.
- Encapsulation overhead: Each packet gets a WireGuard header + Poly1305 auth tag. Small packets (e.g., 64-byte UDP) suffer disproportionately — PPS ceiling drops sharply.
- Single UDP flow on the wire: Even with
iperf3 -P 8(8 TCP streams), the underlay sees one UDP src/dst port pair. NIC RSS hashing lands them all on the same RX queue → same IRQ → same core. - Virtual interface, no hardware IRQ:
wt0itself has no IRQ. Hardware IRQs are on the underlay NIC (eth0). Tuning IRQ affinity onwt0is meaningless; tune the underlay.
These four facts dictate the entire tuning playbook below.
Before any tuning, capture multiple dimensions of behavior. A single number lies; a profile tells the truth.
# TCP forward
iperf3 -c <peer> -t 60 -P 8 -J > tcp_fwd.json
# TCP reverse (asymmetric paths matter, especially through NVAs)
iperf3 -c <peer> -t 60 -P 8 -R -J > tcp_rev.json
# UDP — exposes packet loss, jitter, MTU issues
iperf3 -c <peer> -t 60 -u -b 0 -l 1200 -J > udp.json
# Small-packet PPS test — overlays die here
iperf3 -c <peer> -t 60 -u -b 0 -l 64
# Single-flow ceiling — the per-core crypto limit
iperf3 -c <peer> -t 60 -P 1-P 1vs-P 8delta tells you per-flow CPU bottleneck severity.- TCP vs UDP with same payload: gap reveals TCP stack overhead vs raw path capacity.
-l 64vs-l 1200delta exposes encapsulation overhead pain.- Forward vs reverse uncovers asymmetric NIC offloads, NAT path differences, or one-direction queue saturation.
Run these simultaneously during a test in separate terminals:
# Per-core CPU breakdown — find which core is pinned and on what
mpstat -P ALL 1
# Hardware IRQ distribution — find which IRQ is hot
watch -n 1 'cat /proc/interrupts | grep -E "CPU|mlx5|eth|virtio"'
# Software IRQ distribution — confirms softirq saturation per core
watch -n 1 'cat /proc/softirqs | grep -E "CPU|NET_RX|NET_TX"'
# WireGuard internals
watch -n 1 'wg show all dump'The columns that matter for network tuning:
| Column | Meaning | When it's the bottleneck |
|---|---|---|
%soft |
Time in software interrupts (network stack work) | Single core hits ~90%+ |
%sys |
Kernel mode (syscalls, crypto, copies) | All cores high → CPU-bound globally |
%usr |
User mode (iperf3 itself) | Rare bottleneck for network tests |
%idle |
Free capacity | Low idle on one core = local bottleneck |
Real example from the session:
CPU %usr %sys %soft %idle
0 1.10 47.25 6.59 45.05
1 0.00 5.43 88.04 6.52 ← bottleneck: softirq pinned
2 1.16 46.51 8.14 44.19
3 2.27 46.59 5.68 45.45
CPU1 was at %soft 88% while three other cores sat half-idle. The work was not distributed.
/proc/interrupts shows cumulative counts since boot — useless for live diagnosis. Take a delta during load:
cat /proc/interrupts | grep mlx5_comp > /tmp/irq1.txt
sleep 5
cat /proc/interrupts | grep mlx5_comp > /tmp/irq2.txt
diff /tmp/irq1.txt /tmp/irq2.txtThe IRQ whose count is climbing fastest, and the CPU column it's climbing in, is the live hot spot.
Userspace wireguard-go is 3-5× slower than the kernel module. No tuning recovers this gap.
lsmod | grep wireguard # kernel module loaded?
ip link show wt0 # type should mention wireguard, not tun
ps -eLf | grep wg-crypt # kernel WG spawns wg-crypt-<iface> threadsApply changes one at a time and re-measure. Multi-change tuning sessions can't attribute wins to specific knobs.
# Check current state
ethtool -l eth0
# Enable max queues
sudo ethtool -L eth0 combined 4 # match vCPU countMulti-queue alone doesn't help if all queues' IRQs land on one core — see §4.2.
irqbalance often does the wrong thing on virtio_net and mlx5 in cloud VMs (it tends to power-save by collapsing IRQs onto a single core). Disable and pin manually:
sudo systemctl stop irqbalance
sudo systemctl disable irqbalance
# Distribute mlx5 RX/TX IRQs across cores
IRQS=$(grep 'mlx5_comp' /proc/interrupts | awk '{print $1}' | tr -d ':')
CPU=0
for irq in $IRQS; do
MASK=$(printf '%x' $((1 << CPU)))
echo $MASK | sudo tee /proc/irq/$irq/smp_affinity
CPU=$(( (CPU + 1) % $(nproc) ))
doneNote for virtio_net: TX/RX share IRQs (combined queues). Only virtio*-input.N IRQs exist; there are no separate output IRQs. Don't search for them.
Note for mlx5: IRQs appear as mlx5_comp0, mlx5_comp1, etc. The mlx5_async0 and mlx5_comp0 for control plane should usually be left alone.
When RSS can't help (single UDP flow → single queue, as with WireGuard), RPS distributes work in software based on packet hashing:
# Determine CPU count and build mask
CPU_COUNT=$(nproc)
MASK=$(printf '%x' $(( (1 << CPU_COUNT) - 1 )))
# Apply to all RX queues on the underlay NIC
for q in /sys/class/net/eth0/queues/rx-*/rps_cpus; do
echo $MASK | sudo tee $q
done
# RFS — flow-aware steering, keeps connection on same CPU as its socket consumer
echo 32768 | sudo tee /proc/sys/net/core/rps_sock_flow_entries
for q in /sys/class/net/eth0/queues/rx-*/rps_flow_cnt; do
echo 4096 | sudo tee $q
done
# Also apply to wt0 (the WireGuard virtual interface)
echo $MASK | sudo tee /sys/class/net/wt0/queues/rx-0/rps_cpus
echo 4096 | sudo tee /sys/class/net/wt0/queues/rx-0/rps_flow_cntCommon pitfall: writing ff for a 4-CPU system fails with Value too large for defined data type. The mask must match nproc:
| CPU count | Mask |
|---|---|
| 2 | 3 |
| 4 | f |
| 8 | ff |
| 16 | ffff |
If you do have multiple UDP flows (e.g., multiple WireGuard peers), make sure the NIC hashes on ports too, not just IPs:
# Inspect current
ethtool -n eth0 rx-flow-hash udp4
# Include source/destination ports in hash
sudo ethtool -N eth0 rx-flow-hash udp4 sdfn
# Verify indirection table is even
ethtool -x eth0
sudo ethtool -X eth0 equal $(nproc)sdfn = Source IP, Destination IP, Source port, destinatioN port.
When RPS distributes packets, it queues them onto each target CPU's backlog. If the backlog overflows, packets are silently dropped. This was the root cause of drops appearing after enabling RPS in the session that produced this guide.
Symptoms: column 2 of /proc/net/softnet_stat (decimal: awk '{print strtonum("0x"$2)}') climbing during load.
sudo sysctl -w net.core.netdev_max_backlog=10000
sudo sysctl -w net.core.netdev_budget=600
sudo sysctl -w net.core.netdev_budget_usecs=8000| Knob | Default | Tuned | Purpose |
|---|---|---|---|
netdev_max_backlog |
1000 | 10000 | Per-CPU backlog queue depth |
netdev_budget |
300 | 600 | Packets per softirq run |
netdev_budget_usecs |
2000 | 8000 | Max time per softirq run |
These three knobs are what fixed the drop regression after enabling RPS.
If drops appear in ethtool -S as rx_no_buffer, rx_missed, or rx_dropped, the NIC's hardware queue is overflowing before the kernel reads it.
ethtool -g eth0
# If Current << Pre-set maximum:
sudo ethtool -G eth0 rx 4096 tx 4096Larger buffers help bandwidth-delay product (BDP) saturation, but too large causes memory pressure and packet drops. For 1-3 Gbps WireGuard scenarios, 16 MB cap is appropriate. Do not blindly set 128 MB — it caused regressions in the session.
sudo sysctl -w net.core.rmem_max=16777216
sudo sysctl -w net.core.wmem_max=16777216
sudo sysctl -w net.ipv4.tcp_rmem="4096 131072 16777216"
sudo sysctl -w net.ipv4.tcp_wmem="4096 16384 16777216"BBR generally outperforms CUBIC over overlay tunnels (less retransmit-driven backoff). It requires fq qdisc:
sudo modprobe tcp_bbr
sudo sysctl -w net.core.default_qdisc=fq
sudo sysctl -w net.ipv4.tcp_congestion_control=bbr
# Verify
sysctl net.ipv4.tcp_congestion_controlThe single largest gain available, and the most overlooked. NetBird default MTU is often 1280; underlay typically supports 1500. WireGuard adds ~60 bytes overhead, so wt0 can usually be 1420.
# Probe path MTU with don't-fragment ping
ping -M do -s 1372 <peer-ip> # 1372 + 28 = 1400 payload
ping -M do -s 1412 <peer-ip> # try larger
# Apply
sudo ip link set wt0 mtu 1420
# Or persist via NetBird config
netbird up --mtu 1420When drops appear, find the layer before changing knobs.
# 1. NIC hardware drops
ethtool -S eth0 | grep -iE 'drop|miss|error|discard' | grep -v ': 0$'
# 2. Kernel softirq backlog drops (column 2 of softnet_stat is hex)
awk 'BEGIN {sum=0} {sum += strtonum("0x"$2)} END {print "softnet_dropped:", sum}' /proc/net/softnet_stat
# 3. Interface-level drops
ip -s link show eth0
ip -s link show wt0
# 4. UDP-layer errors (relevant for WireGuard underlay)
nstat -a | grep -iE 'udp.*err|udp.*drop|noport'| Drop counter increasing | Likely cause | Fix |
|---|---|---|
rx_no_buffer, rx_missed (NIC) |
Ring buffer too small | ethtool -G eth0 rx 4096 |
softnet_stat column 2 |
RPS backlog overflow | netdev_max_backlog, netdev_budget |
udp_in_errors (nstat) |
UDP socket buffer full | net.core.rmem_max, app-level socket size |
tx_dropped (interface) |
Qdisc queue overflow | Larger qdisc limit, or fq instead of pfifo_fast |
Conntrack insert_failed |
Conntrack table full | nf_conntrack_max |
Always quantify the change. The session pattern:
# Snapshot before
DROP1=$(awk '{sum += strtonum("0x"$2)} END {print sum}' /proc/net/softnet_stat)
NIC1=$(ethtool -S eth0 | awk '/dropped|missed/ {sum+=$2} END {print sum}')
# Run test
iperf3 -c <peer> -t 30 -P 8
# Snapshot after
DROP2=$(awk '{sum += strtonum("0x"$2)} END {print sum}' /proc/net/softnet_stat)
NIC2=$(ethtool -S eth0 | awk '/dropped|missed/ {sum+=$2} END {print sum}')
echo "softnet delta: $((DROP2-DROP1))"
echo "nic delta: $((NIC2-NIC1))"A few drops in a 30-second test is normal. Thousands per second is a real problem.
Yes, this is the desired outcome. When one core was at %soft 96% while three others were at %soft 8%, you weren't using ~75% of available capacity. Distributing the work means more cores are busy — not because there's "more" work, but because previously-idle cores are now contributing.
The question to ask is not "is CPU high?" but "is throughput improving and are drops staying low?" If yes, the tuning is working.
WireGuard's per-peer mutex means single-flow throughput has a hard ceiling at one core's crypto rate, no matter what you do at the kernel network layer. RPS distributes packet handling, but not the encryption itself. If you're testing single-peer single-flow and seeing one core saturated on %sys (not %soft), that's the WireGuard crypto thread — and it's a wall.
The only ways past it: more peers, larger MTU (fewer packets per byte), or a faster cipher (not user-tunable in WireGuard).
On mlx5 in Hyper-V/Azure, on virtio_net in cloud VMs generally, irqbalance tends to collapse IRQs onto a single core for power-save reasons. If you see one CPU's IRQ count vastly higher than others despite multi-queue being enabled, suspect irqbalance first. Stop, disable, and pin manually.
iperf3's -P 8 opens 8 TCP streams. After WireGuard encapsulation, all 8 ride a single UDP flow (one src port, one dst port). RSS sees one 5-tuple, hashes to one queue. To get true multi-flow on the underlay you need either multiple WireGuard peers or multiple iperf3 instances on different UDP ports — and even then NetBird may multiplex through one tunnel.
/proc/interrupts totals are since boot. A "hot" CPU in the totals may simply have been hot once, hours ago. Sample twice during load, diff, and look at the rate of change.
- Baseline: run all five iperf3 scenarios, save JSON.
- Identify:
mpstat+ IRQ delta + softirq delta during load. - Hypothesize: name the bottleneck before changing anything.
- Change one knob, re-run a single scenario, compare.
- Watch drops: every change can introduce them. Quantify.
- Record: build a comparison table of (config version × throughput × drops × p99 CPU).
- Stop when: throughput plateau is reached, drops are minimal, and CPU is reasonably distributed across cores.
| Config | Throughput | Top core %soft | softnet drops/30s | NIC drops/30s |
|---|---|---|---|---|
| Defaults | X | 96% (1 core) | 0 | 0 |
| + IRQ affinity | Y | 88% (1 core) | 0 | 0 |
| + RPS/RFS | Z | 35% (avg) | many | low |
| + backlog/budget | Z' | 35% (avg) | 0 | 0 |
| + MTU 1420 | Z'' | similar | 0 | 0 |
This is what you submit as evidence that tuning succeeded.
Manual settings are lost on reboot. To make them stick:
sudo tee /etc/sysctl.d/99-netbird-tuning.conf <<EOF
net.core.netdev_max_backlog = 10000
net.core.netdev_budget = 600
net.core.netdev_budget_usecs = 8000
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 131072 16777216
net.ipv4.tcp_wmem = 4096 16384 16777216
net.core.rps_sock_flow_entries = 32768
net.core.default_qdisc = fq
net.ipv4.tcp_congestion_control = bbr
EOF
sudo sysctl --systemThese are not sysctl — wrap them in a systemd unit:
# /etc/systemd/system/network-tuning.service
[Unit]
Description=Network tuning (RPS, IRQ affinity, ring buffers)
After=network-online.target netbird.service
Wants=network-online.target
[Service]
Type=oneshot
ExecStart=/usr/local/sbin/network-tune.sh
RemainAfterExit=yes
[Install]
WantedBy=multi-user.targetWhere /usr/local/sbin/network-tune.sh contains the IRQ-pinning and RPS-write logic.
Put --mtu 1420 (or your tested value) into the NetBird daemon config or unit file so it's applied on every netbird up.
# === Live diagnostics ===
mpstat -P ALL 1 # per-core load
watch -n 1 'cat /proc/softirqs | grep NET_RX' # softirq distribution
ethtool -S eth0 | grep -iE 'drop|miss' # NIC drops
awk '{print strtonum("0x"$2)}' /proc/net/softnet_stat # backlog drops
# === Quick tuning sequence ===
sudo systemctl stop irqbalance
sudo ethtool -L eth0 combined $(nproc)
sudo ethtool -G eth0 rx 4096 tx 4096
CPU=0; for i in $(grep mlx5_comp /proc/interrupts | awk '{print $1}' | tr -d ':'); do
printf '%x' $((1<<CPU)) | sudo tee /proc/irq/$i/smp_affinity
CPU=$(((CPU+1)%$(nproc)))
done
MASK=$(printf '%x' $(( (1 << $(nproc)) - 1 )))
for q in /sys/class/net/eth0/queues/rx-*/rps_cpus; do echo $MASK | sudo tee $q; done
sudo sysctl -w net.core.netdev_max_backlog=10000
sudo sysctl -w net.core.netdev_budget=600
sudo sysctl -w net.core.netdev_budget_usecs=8000
# === Validation ===
iperf3 -c <peer> -t 30 -P 8You've hit the end of the road when:
- All cores are 60-90% busy during load (work is distributed)
- Drops are zero or trivial (sub-100 per 30s test)
wg showreports steady transfer, no handshake churn- Single-flow throughput equals one core's crypto rate
- Multi-flow throughput approaches the NIC's underlay bandwidth
Beyond this, the next gains require:
- More peers to break per-peer crypto serialization
- Larger MTU on underlay (jumbo frames in datacenter, not generally available in cloud)
- Faster hardware — newer CPU with AES-NI/AVX gains, or different VM SKU with more vCPU
- Different topology — terminating tunnels on dedicated NVAs instead of workload nodes
At that point you're no longer tuning; you're architecting.