Or: How I Reproduced the Problem on x86, Tried to Load the Missing Modules on the Real Device, and What That Tells Us About Ubiquiti's Kernel
Ubiquiti markets the Enterprise Fortress Gateway (EFG) as a 25-gigabit-class router. The product page lists two 25 GbE SFP28 ports for WAN/LAN, and Ubiquiti positions the device as a flagship for medium and large enterprise deployments. Its silicon — a Marvell Octeon CN9670 — supports hardware-accelerated forwarding through purpose-built network engines (NIX) that should sustain tens of millions of packets per second. The UDM Beast, Ubiquiti's next-generation gateway, pairs a Marvell Octeon CN10K SoC (with ARM Neoverse N2 cores) with a dedicated Marvell Prestera-class switching ASIC accessed via PCIe — capabilities that, properly used, would offload most of the per-packet forwarding work into hardware.
In practice, real-world enterprise deployments report:
- Inter-VLAN routing: ~1–1.5 Gbps single-stream, regardless of how fast the upstream link is
- PPPoE WAN throughput: ~2–3 Gbps single-stream on 10 Gbps fiber connections, where the ISP requires PPPoE authentication
- Total aggregate throughput: well below the marketed 25 Gbps WAN/LAN figures
- With IDS/IPS enabled: Ubiquiti markets a "12 Gbps with IPS" rate, claimed against internet (LAN→WAN) traffic. In practice this is also unachievable on single-stream measurements — LAN→WAN traffic crosses a subnet boundary, hits NAT, and traverses the same single-core kernel forwarding path the writeup documents for inter-VLAN traffic, plus the additional cost of NAT mangling, plus PPPoE encapsulation if applicable. The 12 Gbps figure is achievable only as aggregate throughput across many parallel TCP flows with TSO/checksum offload and CPU work spread across cores via RSS. Single-stream LAN→WAN throughput with IPS enabled is bounded by the same per-core kernel forwarding ceiling as inter-VLAN — typically 1–2 Gbps.
This document analyzes both bottlenecks. It reproduces both problems in a controlled lab environment on x86 hardware, identifies the specific software architectural choices that cause them, demonstrates fixes whose effects can be measured to a precision of a few hundred Mbps, and documents in detail what happened when we attempted to apply the most surgical of those fixes — adding the missing nftables flowtable module — to a real production EFG.
We will show that the EFG's stock configuration delivers between 5% and 15% of the throughput its silicon is capable of. We will show three independent fixes that together can push it from ~1 Gbps single-stream to over 25 Gbps single-stream — without adding hardware. Two of those fixes are pure software configuration changes; the third is a kernel module that exists in mainline Linux and is shipped by Marvell themselves, but is not present in Ubiquiti's kernel build.
We then attempt to install the missing module on a real EFG. Building it against vanilla Linux 5.15.72 produces a kernel module with byte-perfect vermagic — and crashes the device on load. Building it against Marvell's complete published OCTEON BSP source from the Yocto Project produces another byte-perfect module that crashes at the identical function offset. Symbol-level analysis of the running EFG kernel reveals 6,357 unique symbols that exist in neither vanilla Linux nor Marvell's complete public BSP. These include conntrack extensions for proprietary DPI integration (nf_ct_ext_dpi_destroy, nf_conntrack_dpi_init), a 116-symbol tdts namespace exposing kernel internals to a closed-source Trend Micro DPI engine, and significant hardware abstraction additions.
To rule out the possibility that the EFG is an outlier, the same diagnostic methodology is applied to a second Ubiquiti gateway — the UDM Beast, with newer Marvell Octeon CN10K silicon (ARM Neoverse N2 cores), a kernel newer by 18 months (6.6.46), and a dedicated Marvell switching ASIC accessed via PCIe. The result: the same architectural pattern across silicon generations. The ASIC is physically present and processes 1.27 billion intra-VLAN packets, but switchdev offload is hard-disabled across every interface (hw-tc-offload: off [fixed]), tc filter rules report not_in_hw, and 67 GB of WAN traffic has gone through a CPU-only software path mirrored to an ifb device. Inter-VLAN routing, like on the EFG, runs entirely in the kernel software stack. A faster CPU moves the floor up; the architecture is unchanged.
The conclusion: Ubiquiti has built a substantially modified kernel that they have not released sources for, and Ubiquiti's open-source download page no longer exists. Their GitHub organization contains no firmware or kernel sources. Closed-source tdts and t_miner modules link directly against kernel symbols and operate as derived works of the kernel. This appears to violate GPL-2.0, and continues a pattern: Ubiquiti was publicly accused of GPL violations in 2015 (resolved after sustained pressure) and again in 2019.
The performance issues in this document have been reported to Ubiquiti through their support channel for approximately one year, including specific implementation guidance pointing to Marvell's published DPDK reference architecture; no substantive engineering response has been received. Separately, security findings about the EFG's deliberate absence of secure boot, module signing, and integrity protection were submitted through Ubiquiti's HackerOne bug bounty program and rejected on the grounds that the attacker would require network access — a rationale that does not survive scrutiny when applied to a network gateway.
This is therefore both a technical analysis and a software-license compliance analysis, and it is published only after the channels designed for vendor engagement have failed to produce a response.
This document was produced collaboratively with an AI assistant (Anthropic's Claude). The AI's role was to help structure findings, draft prose, suggest diagnostic commands, and consolidate the final write-up. All measurements, kernel builds, module load attempts, packet captures, EFG and UDM Beast diagnostics, lab tests, and reproductions described here were performed by me on real hardware and in real VMs that I personally configured and operated. The hardware exists, the commands were run, the crashes happened on my devices, and the outputs are real.
AI assistance does not eliminate human error — and using it can introduce new sources of error when the AI fills in plausible-sounding details that don't match reality. I have done my best to validate the technical claims in this document against my own measurements, kernel source, and external references. A factual error has already been caught and corrected (an earlier revision incorrectly referred to nf_flow_table_pppoe as a separate kernel module — it is not; PPPoE flowtable handling is inline within nf_flow_table.ko). That correction was made because a reader pushed back (not in a friendly nor constructive way, as usual, but I digress), and I'm grateful they did.
If you spot anything else that looks wrong — a command output that doesn't match what your system shows, a kernel-internal claim that contradicts the source you can read on kernel.org, a misidentified piece of silicon, or anything else — please tell me. I'd rather revise this in public than leave a mistake standing or try to brute force misconception based on fanboyism or guesswork. The goal of the document is to accurately describe how the EFG and UDM Beast actually behave, not to win an argument.
Comments, corrections, reproductions and constructive criticism (like I got already in the comments ❤️) are always welcome.
- The Problem
- Test Environment
- Methodology
- The Reference Run: Real EFG Diagnostics
- Reproducing the Bottleneck — virtio-net Test Matrix
- Closing the Loop — Real Silicon Test Matrix
- Userspace Dataplane — VPP/DPDK Comparison
- The PPPoE Bottleneck — A Related but Distinct Problem
- Cross-Product Confirmation: UDM Beast and UCG Fiber
- Findings: The Architectural Failures
- Recommended Fixes
- Direct Experimental Verification — Building the Missing Modules
- Symbol-Level Forensics on the Running EFG Kernel
- The GPL Compliance Question
- Direct Vendor Engagement: What Ubiquiti Has Already Been Told
- Conclusion
- Appendix: Full Data Sets
Ubiquiti markets the Enterprise Fortress Gateway (EFG) as a 25-gigabit-class router. The product page lists two 25 GbE SFP28 ports for WAN/LAN, and Ubiquiti positions the device as a flagship for medium and large enterprise deployments. Its silicon — a Marvell Octeon CN9670 — supports hardware-accelerated forwarding through purpose-built network engines (NIX) that should sustain tens of millions of packets per second. The UDM Beast, Ubiquiti's next-generation gateway, pairs a Marvell Octeon CN10K SoC (with ARM Neoverse N2 cores) with a dedicated Marvell Prestera-class switching ASIC accessed via PCIe — capabilities that, properly used, would offload most of the per-packet forwarding work into hardware.
In practice, real-world enterprise deployments report:
- Inter-VLAN routing: ~1–1.5 Gbps single-stream, regardless of how fast the upstream link is
- PPPoE WAN throughput: ~2–3 Gbps single-stream on 10 Gbps fiber connections, where the ISP requires PPPoE authentication
- NAT throughput: similar single-flow ceilings whenever IPS, deep-packet-inspection, or threat management features are enabled
Customers complain, post mpstat screenshots showing one CPU core saturated while the other 17 sit idle, and get told it is a hardware limitation.
It is not. The CPUs are not the bottleneck. The silicon is not the bottleneck. The bottleneck is the configuration of the Linux kernel network stack that ships on the device, including:
- Hardware offload features that are explicitly disabled
- A modern kernel fast-path feature (
nf_flow_table) that is not loaded - A user-space inspection engine running on the same CPU core that is forwarding packets
- A 5-deep iptables FORWARD chain that every new connection must traverse
- Conntrack protocol helpers loaded for legacy protocols (PPTP, H.323) that no enterprise control plane lets you disable
- Per-VLAN bridges instead of a vlan-aware single bridge
- No DPDK fast-path despite Marvell shipping first-class DPDK PMDs (
cnxk) for these exact SoCs
Each one of these contributes measurable overhead. Combined, they drop forwarding throughput by an order of magnitude. The point of this article is to measure each contribution independently and show what a properly-configured Linux router looks like on the same workload.
- CPU: AMD Ryzen Threadripper Pro 7995WX, 96 cores / 192 threads, base 2.5 GHz, boost 5.1 GHz, Zen 4 microarchitecture
- RAM: 754 GB DDR5 ECC
- Hypervisor: Proxmox VE 9.0.11
- Kernel: Linux 6.14.11-4-pve
- Storage: NVMe ZFS root pool (
rpool) - Networking: Mellanox ConnectX-6 Dx dual-port 100 Gbps NIC (MT2892), bonded LACP 802.3ad
- IOMMU: AMD-Vi enabled in passthrough mode
- Hugepages: 64 × 1 GB = 64 GB reserved at boot
- EFG (Enterprise Fortress Gateway), Ubiquiti Networks
- Marvell Octeon CN9670 SoC, 18 ARM v8.2 cores @ 2.0 GHz
- 64 GB RAM
- Linux 5.15.72-ui-cn9670 (vendor build)
- Live production firewall, 8 days uptime at capture, 7 active VLANs in an enterprise office network
Three Ubuntu 24.04 LTS VMs were cloned from a common template, each pinned via a Proxmox hookscript to a dedicated CCD on the host (8 vCPUs each):
192.168.6.0/24 (mgmt — for SSH, never used for test traffic)
| | |
+----------+----------+
| | |
gw-router client1 client2
(VM 200) (VM 201) (VM 202)
8 cores 8 cores 8 cores
16 GB RAM 8 GB RAM 8 GB RAM
Cores 8-15 Cores 16-23 Cores 24-31
For test traffic (multiple network paths used in different tests):
client1 ────[VLAN 10]──── gw-router ────[VLAN 20]──── client2
10.10.10.10 10.10.20.10
↕ ↕
gw-router gw-router
10.10.10.1 10.10.20.1
The VMs received traffic through one of three I/O paths during testing:
- virtio-net through Linux bridges with VLAN tagging (
vmbr1on the host) - ConnectX-6 Dx VFs via SR-IOV passthrough (4 VFs total, 2 to gw-router, 1 to each client)
- VPP/DPDK with the same VFs polled directly by VPP's worker threads in userspace
Single TCP stream iperf3 at MTU 1500 was used as the primary measurement. Multi-stream tests with -P 8 were used in select cases to demonstrate scaling behavior. Each measurement ran for 30 seconds with per-second reporting; the values reported are the iperf3 sender/receiver final summary, which agree to within 0.1 Gbps in all cases.
To prove the architectural argument we needed to isolate independent variables:
| Variable | Settings tested |
|---|---|
| I/O fabric | virtio-net (vhost-net backend), ConnectX VF (SR-IOV passthrough) |
| MTU | 1500, 9000 |
| Hardware offloads (GRO/TSO/LRO) | on, off |
| Forwarding rules | none, EFG-replica 5-chain ruleset |
| Forwarder | kernel ip_forward, kernel ip_forward + nftables flowtable, VPP/DPDK userspace |
For each combination, single-stream iperf3 between client1 and client2 (i.e. across the gw-router VM, between two distinct IPv4 subnets) was measured. Because the host CPU does not vary across tests and because vCPU pinning is fixed via a Proxmox hookscript that calls taskset after the VM starts, every test runs on the same physical cores in the same NUMA configuration.
The "EFG-replica 5-chain ruleset" was constructed from observation of the live EFG. It mirrors the EFG's iptables FORWARD structure of ALIEN → TOR → IPS → UBIOS_FORWARD_JUMP → user → default chains, with conntrack lookups, protocol/port matchers, and per-chain counters that force per-packet evaluation in the slow path. The exact ruleset is in the appendix.
Before running anything in the lab, we captured the configuration of a production EFG to know what we needed to reproduce. Every command below was executed on a customer-deployed EFG running stock Ubiquiti firmware. None of these settings are user-configurable from the UI — they are baked into how the device is configured through the UniFi Web UI, which eventually reflects on the changes in the underlying Linux subsystems.
$ uname -a
Linux EFG-Home-SP 5.15.72-ui-cn9670 #5.15.72 SMP Wed Apr 15 23:39:47 CST 2026 aarch64
$ nproc
18
$ free -h
total used free shared buff/cache available
Mem: 63Gi 11Gi 46Gi 106Mi 5.3Gi 44Gi
$ uptime
02:09:29 up 8 days, 5:17, 1 user, load average: 2.52, 1.84, 1.86
Confirmed: Octeon CN9670 (per the kernel build identifier), 18 cores, 64 GB RAM. Kernel 5.15 dates from late 2021 — it predates several material networking improvements in 5.19+ (better flowtable hardware offload, improved nft, better mptcp, PPPoE flowtable acceleration in 6.2+).
$ iptables -L FORWARD -n -v --line-numbers
Chain FORWARD (policy ACCEPT 1033 packets, 157K bytes)
num pkts bytes target source destination
1 555K 775M ALIEN 0.0.0.0/0 0.0.0.0/0
2 2764K 4489M TOR 0.0.0.0/0 0.0.0.0/0
3 238M 354G IPS 0.0.0.0/0 0.0.0.0/0
4 874M 1342G UBIOS_FORWARD_JUMP 0.0.0.0/0 0.0.0.0/0
In 8 days of uptime, this device has pushed:
- 874 million packets through
UBIOS_FORWARD_JUMP - 238 million through the
IPSchain - 2.76 million through
TOR - 555 thousand through
ALIEN
Every packet that this gateway routes traverses at least 4 jump targets in sequence, plus whatever rules live inside each. Total rule count across filter, mangle, and nat tables:
$ iptables -t filter -L -n | wc -l
572
$ iptables -t mangle -L -n | wc -l
187
$ iptables -t nat -L -n | wc -l
80
839 rules total. And it's all running on the legacy iptables (xt_*) backend. The modern nft API is not in use:
$ nft list ruleset | wc -l
0
$ nft list flowtables
[empty output]
$ lsmod | grep -iE "flow_table|flowtable"
[empty output]
$ for iface in eth0 eth1 eth2 eth3; do
ethtool -k $iface | grep hw-tc-offload
done
[no output - module not loaded, feature not available]
The nf_flow_table kernel module is not loaded. There is no nft flowtable. There is no hardware tc-flower offload. The kernel's modern fast-path infrastructure — which can bypass conntrack and rule evaluation for established flows — is not even installed on this device.
This single missing piece is, as the lab measurements will show, worth a 3× to 7× single-stream throughput improvement on its own.
$ sysctl net.netfilter.nf_conntrack_max
net.netfilter.nf_conntrack_max = 10485760
$ sysctl net.netfilter.nf_conntrack_count
net.netfilter.nf_conntrack_count = 846
$ lsmod | grep nf_conntrack
nf_conntrack_tftp 262144 1 nf_nat_tftp
nf_conntrack_pptp 327680 1 nf_nat_pptp
nf_conntrack_h323 327680 1 nf_nat_h323
nf_conntrack_ftp 327680 1 nf_nat_ftp
Four conntrack protocol helpers loaded: FTP, PPTP, H.323, TFTP. PPTP is a deprecated VPN protocol from the late 1990s. H.323 is a videoconferencing protocol from 1996, mostly displaced by SIP. TFTP and FTP are increasingly rare in modern enterprise environments.
The actual per-packet cost of having helpers loaded is more nuanced than "every packet is inspected" — see Section 10 Finding 5 for the precise breakdown. The short version: established non-helper flows pay essentially nothing per packet (a pointer check), but every new connection pays a hash lookup against the helper registry, and any flow on a helper-recognized port (FTP/21, etc.) pays the full inspection cost.
A "Firewall Connection Tracking" toggle does exist in the UniFi controller's Gateway settings, allowing administrators to disable individual helpers (FTP, H.323, SIP, GRE, PPTP, TFTP). Disabling them all unloads the helper modules from memory entirely. This addresses the lookup cost on new flows but does not affect already-established TCP throughput (the iperf3 inter-VLAN measurement is unchanged), and does not address the bigger architectural bottlenecks documented in Sections 5-10. Section 10 Finding 5 expands on what helpers actually cost and what would be required to keep helper functionality without the cost.
$ ps -eo pid,pcpu,pmem,comm --sort=-pcpu | head -10
PID %CPU %MEM COMMAND
4098469 39.6 0.0 dpi-flow-stats
3139 12.5 0.1 ubios-udapi-ser
66687 7.8 3.1 java
4891 7.0 0.0 conntrackd
2491041 6.9 1.6 Suricata-Main
5505 6.2 0.0 mcad
8596 3.9 0.9 unifi-core
4482 3.8 0.0 ulogd
dpi-flow-stats consuming 39.6% of one CPU core continuously. Add Suricata IPS (6.9%) and conntrackd (7.0%) and you have ~54% of one core permanently consumed by per-packet inspection processes that don't forward anything — they just observe.
The CPU pinning details matter here, and the writeup's earlier framing — which lumped these userspace processes together as "running on the forwarding core" — oversimplified the picture. The accurate picture is:
Suricata is configured with explicit CPU affinity in /usr/share/ubios-udapi-server/ips_6/config/suricata_ubios_high.yaml:
threading:
set-cpu-affinity: yes
cpu-affinity:
- management-cpu-set:
cpu: [ 0 ]
- receive-cpu-set:
cpu: [ "all" ]
- worker-cpu-set:
cpu: [ "all" ]
prio:
default: "high"
- verdict-cpu-set:
cpu: [ 1 ]
prio:
default: "high"
detect-thread-ratio: 1.0
nfq:
mode: repeat
repeat-mark: 1
repeat-mask: 1
bypass-mark: 1
bypass-mask: 1
fail-open: yesThis tells us several things:
- Management thread is pinned to core 0 — single-threaded, single-core
- Receive and worker threads can use all 18 cores
- Verdict thread is pinned to core 1 — relevant in IPS mode (see below)
- Both
pcap:andnfq:sections coexist in the same configuration: the YAML supports either mode, with the active mode determined at Suricata launch time
The kernel command line additionally includes isolcpus=12, isolating core 12 from the general scheduler — likely reserving it for one of Suricata's worker threads when running.
The IDS/IPS toggle in the UniFi controller does not change Suricata's runtime architecture. This was verified directly: with "Intrusion Prevention" toggled ON in the UniFi controller, Suricata is still launched with --pcap, there are zero NFQUEUE rules in iptables (iptables-save | grep -i nfqueue returns empty), and /proc/net/netfilter/nfnetlink_queue is empty. The nfq: section and the verdict-cpu-set: [ 1 ] pinning in the YAML are dead config — they would activate only if Suricata were launched with -q <queue>, which it never is on this device.
Confirmed runtime architecture, from the running /var/log/suricata/suricata.log:
RunModeIdsPcapWorkers initialised
all 6 packet processing threads, 2 management threads initialized, engine started.
Suricata observes packets via libpcap on six bridge interfaces (one worker thread per bridge: br0, br254, br3, br5, br6, br7, configured in /run/ips/config/iface.yaml with threads: 1 per interface). It loads 32,033 signatures (per suricata.log: "32031 rules successfully loaded" + 2 threshold rules) and a closed-source Ubiquiti Suricata plugin: /usr/share/ubios-udapi-server/ips_6/suricata/lib/aarch64-linux-gnu/ubnt-idsips-daemon.so.
The running Suricata version is end-of-life software:
$ /usr/share/ubios-udapi-server/ips_6/suricata/bin/suricata -V
This is Suricata version 6.0.12 RELEASE
The Suricata 6.0.x branch was officially declared end-of-life by the upstream project on August 1st, 2024. Per the official announcement: "This means we'll be providing no more support, releases or (security) fixes for this branch. We strongly encourage everyone who is still using Suricata 6 or older to upgrade to Suricata 7 as soon as possible." The final 6.0.x release was 6.0.20.
Specifically:
- Suricata 6.0.12 was released approximately April 2023
- The EFG ships a version that is 8 patch releases behind the last 6.0.x release
- The 6.0.x branch has received zero security fixes since August 2024 — over 21 months as of this document's publication date in May 2026
- Suricata 7.0.x is the current LTS, supported until September 2026
- Suricata 8.0.x is the latest major
The EFG ships the Suricata upgrade on every device and chooses not to activate it. The filesystem layout on this production EFG:
$ ls -la /usr/share/ubios-udapi-server/
drwxr-xr-x 2 root root 4096 Apr 22 21:09 ips/ ← 68-byte version selector
drwxr-xr-x 1 root root 4096 May 2 20:55 ips_6/ ← Suricata 6.0.12 (EOL, ACTIVE)
drwxr-xr-x 6 root root 81 Apr 8 06:24 ips_8/ ← Suricata 8.0.2 (current, INACTIVE)
$ /usr/share/ubios-udapi-server/ips_8/suricata/bin/suricata -V
This is Suricata version 8.0.2 RELEASE
$ ls /usr/share/ubios-udapi-server/ips_8/config/
afpacket.tmpl category_list.json iface.tmpl
reference.config static_config.json suricata_ubios_high.yaml
The ips_8/ directory is not a placeholder. It contains a fully working Suricata 8.0.2 binary, complete templates, the same suricata_ubios_high.yaml configuration filename used by the active ips_6/, and the full bin//config//rules//suricata/ packaging structure. The minimal ips/ directory contains only a version.json (68 bytes) — likely a version selector that decides which ips_N/ directory the running daemon points at.
Suricata 8.0.2 is also a substantial functional upgrade over 6.0.12. Per the Suricata 8 help output on this device:
Firewall:
--firewall : enable firewall mode
--firewall-rules-exclusive : path to firewall rule file loaded exclusively
Suricata 8 introduces a native --firewall mode that could replace the iptables IPS chain + ipset pattern entirely with a Suricata-native rule engine. Adopting it would require Ubiquiti to port the closed-source ubnt-idsips-daemon.so plugin from the Suricata 6.x plugin API to the 8.x plugin API and to rewrite the integration glue. That work has not been done, or has been done and not deployed.
Either way, the situation is: Ubiquiti has the supported Suricata staged on every shipping EFG and has actively chosen to point the version selector at the end-of-life binary. This is not a "haven't gotten around to upgrading" situation — the upgrade is sitting on the device, ready to be selected. The decision to keep running EOL Suricata 6.0.12 in May 2026, while Suricata 8.0.2 is shipped on the same device, is deliberate.
This is the inspection engine that an enterprise security gateway uses to detect threats on its data path. It is running unsupported software with no security patches for 21 months, on a device that costs approximately $2,000 and is marketed as a flagship security gateway, while the supported version is staged on the same device's filesystem.
TLS visibility on the EFG is selective at best. Suricata in pcap mode sees encrypted ciphertext on the wire for HTTPS sessions. Without TLS interception (an MITM proxy decrypting and re-encrypting traffic using a CA cert distributed to clients), Suricata cannot inspect HTTP response bodies, exfiltrated data, or C2 traffic inside HTTPS sessions. As of 2026, this represents the majority of internet traffic — making "IDS/IPS" coverage of HTTPS the central question for any inline security product.
This was tested directly. From a host behind the EFG with IPS enabled and 32,033 signatures loaded:
$ curl -s https://testmynids.org/uid/index.html
uid=0(root) gid=0(root) groups=0(root)
The response payload is the canonical test string for the Suricata GPL ATTACK_RESPONSE id check returned root signature — designed to fire on uid=0(root) byte sequences in HTTP response bodies. After 5+ minutes, on the EFG:
$ tail -100 /var/log/suricata/eve.json | grep -i "GPL ATTACK"
[empty]
$ tail -100 /var/log/suricata/fast.log | grep -i "GPL ATTACK"
[empty]
$ journalctl -u syslog-ng | grep -i "attack_response"
[empty]
No alert generated. No entry in any log. The signature payload reached the test host through the EFG's IPS without detection.
Ubiquiti does ship a TLS interception product, branded NeXT AI Inspection in the UniFi controller UI ("NextAI" in shorthand). It is an opt-in feature with three modes (Off, Simple, Advanced) and is not engaged for general HTTPS traffic by default. Per Ubiquiti's documented architecture, NeXT AI Inspection is a separate pipeline that:
- Captures packets selected for inspection (per the configured domain inclusion list — by default "Specific" rather than "All")
- Enqueues them to a RabbitMQ broker running on the EFG itself
- A proprietary SSL inspection process dequeues the packets, decrypts using a UniFi-generated CA certificate, inspects content (including content-type filtering — blocking specific file types like archives, PDFs, and spreadsheets while allowing others), re-encrypts
- Re-enqueues the inspected traffic to an outbound queue
- Another component pulls from the outbound queue and forwards
- Only after decryption does the pipeline send the cleartext to Suricata for signature inspection
This architecture has multiple problems beyond what the curl test demonstrated:
- RabbitMQ in the data path. RabbitMQ is an Erlang-based AMQP message broker designed for inter-service messaging at millisecond timescales. Per-packet routing through an AMQP broker imposes TCP framing for AMQP, routing-key matching, persistence (or the cost of disabling it), at-least-once delivery semantics, and Erlang VM scheduler decisions on every packet. This is fundamentally incompatible with a multi-Gbps data path. Either NeXT AI's actual throughput is substantially lower than advertised or the broker is operating in a degraded mode that defeats most of what one would use AMQP for.
- CA certificate distribution is unenforced and customer-managed. UniFi generates the CA certificate; the customer is responsible for installing it on each client. The UI explicitly notes: "Download and install the NeXT AI Inspection certificate on each client to avoid losing internet access. Use UniFi Identity for seamless certificate distribution." UniFi Identity is a separate product. On most networks, the cert ends up installed on managed corporate laptops but not on BYOD, IoT, mobile devices, guest-network clients, native apps with cert pinning, or any device the IT team doesn't directly control. The curl test above succeeded over HTTPS without certificate errors, meaning the test client correctly received the real testmynids.org certificate — NeXT AI was not in the path for that flow, either because the host lacks the UniFi CA or because the test domain wasn't on the inclusion list.
- Inspection scope is selective by default. The UI's "What to Inspect" defaults to specific domains rather than all traffic. This is a defensible design choice for performance (you don't want to MITM Netflix or financial institutions), but it means even when NeXT AI is enabled, IPS visibility into HTTPS is limited to the inclusion list.
- Suricata sees plaintext only for NeXT-AI-fronted flows. For all other HTTPS sessions — anything outside the inclusion list, anything from clients without the CA installed, anything in flows that bypass NeXT AI for performance — Suricata sees ciphertext only. The 32,033 signatures loaded — most of which target HTTP-layer attack patterns — are matched against encrypted bytes for the bulk of modern web traffic.
The combination is significant. The EFG's "IPS" can only inspect HTTPS traffic where ALL of these conditions hold simultaneously: (a) the destination domain is on the customer-configured NeXT AI inclusion list, (b) the client has the UniFi-generated CA certificate installed and trusted, (c) the flow is routed through the RabbitMQ-based NeXT AI pipeline, and (d) the flow's volume fits within whatever the broker can sustain. For any HTTPS traffic outside that intersection — which is the vast majority of internet traffic on a typical enterprise network — Suricata sees only ciphertext, and the IPS function is effectively non-existent for the most relevant threat vectors.
The architectural contrast with proper inline IPS is sharp here. Suricata 7.0+ supports DPDK mode, where Suricata runs as a pipeline stage on the dataplane workers. With DPDK + VPP + Suricata-on-DPDK and TLS interception integrated as a DPDK pipeline stage (using something like Intel's QAT for offloaded crypto, or even kernel TLS offload), packet decrypt → inspect → encrypt → forward happens entirely in userspace on dedicated cores, with no kernel→userspace copies, no broker hops, and no RabbitMQ. The throughput overhead of TLS-aware inline IPS in this architecture is in the low single-digit percent on modern hardware, not the order-of-magnitude penalty of the EFG's current RabbitMQ-based design.
The "IPS" data path is retroactive blocking by 3-tuple, not inline prevention:
(in-process, closed-source)
┌─────────────────────────────────┐ ┌───────────────────────┐
│ Suricata --pcap │ │ ubnt-idsips-daemon.so │
│ • 6 worker threads (one per │ ───→ │ writes UNIX DGRAMs to │
│ bridge: br0/3/5/6/7/254) │ │ /run/ips/eve_alert │
│ • 2 management threads │ │ .json socket │
│ • 32,033 signatures loaded │ └───────────┬───────────┘
└─────────────────────────────────┘ │
▼
┌───────────────────────┐
│ ubnt-idsips-daemon │
│ (separate userland │
│ process, closed- │
│ source) │
│ • parses alerts │
│ • populates ipset 'ips'
│ when IPS toggle on │
│ • forwards to syslog │
└───────────┬───────────┘
│ netlink
▼
┌───────────────────────┐
│ ipset 'ips' │
│ hash:ip,port,ip │
│ timeout 0, max 65536 │
└───────────┬───────────┘
│
▼
┌───────────────────────┐
│ iptables IPS chain │
│ -m set --match-set │
│ ips dst,dst,src │
│ -j IPSLOGNDROP │
└───────────────────────┘
The IDS/IPS toggle in the UniFi controller appears to control only one specific behavior in the ubnt-idsips-daemon process: whether it populates the ipset when alerts fire. Both modes use the same Suricata invocation, the same --pcap capture, the same workers, the same alerts. The difference is policy in a closed-source userland daemon, not architecture.
The ipset characteristics matter:
$ ipset list ips
Name: ips
Type: hash:ip,port,ip
Revision: 6
Header: family inet hashsize 1024 maxelem 65536 timeout 0 bucketsize 12
Size in memory: 208
References: 2
Number of entries: 0
- Type
hash:ip,port,ip: blocking is per-flow-tuple (source IP, destination port, destination IP), not per-source-IP timeout 0: entries never expire; once added, blocked until ipset flush or device rebootmaxelem 65536: maximum 65,536 simultaneously-blocked tuplesNumber of entries: 0: empty across multiple samples, on a production EFG with 8 days of uptime, IPS enabled, processing real multi-VLAN traffic with 32K signatures loaded
What this means in practice:
- The first packet matching a signature always reaches its destination. Suricata observes it via pcap, generates an alert, the daemon receives the alert from the UNIX socket, parses it, and adds the 3-tuple to the ipset. By the time the kernel can drop based on ipset match, the malicious packet has already been forwarded.
- Subsequent packets matching the same 3-tuple (source IP / dest port / dest IP) are blocked retroactively.
- Detection-to-block latency is seconds — Suricata processes packets in batches, the daemon parses datagrams, ipset population goes through netlink. The end-to-end latency from "first malicious packet observed" to "ipset blocks future packets" is non-trivial.
- If an attacker uses one IP to probe (gets blocked) and a different IP for the actual attack, each new source IP gets one free shot before the ipset entry is added.
- After 8 days of production uptime on a multi-VLAN enterprise gateway with IPS enabled, the ipset is empty. Either no traffic has triggered any of 32,033 signatures, or signatures fire but the daemon's policy threshold for ipset population isn't being met.
This is more accurately described as delayed reactive blocking than as Intrusion Prevention. An "Intrusion Prevention System" that observes an SQL injection payload, alerts on it, and then blocks future traffic from the source IP — after the original payload has already reached the database server — has not prevented the intrusion. It has only prevented follow-up traffic from the same source.
The architectural reason this design exists is interesting: properly inline IPS via NFQUEUE would put every inspected packet through Suricata's worker threads with verdict reinjection through a single core (the YAML's verdict-cpu-set: [ 1 ]). On the EFG's 2 GHz Octeon cores, this would significantly worsen the already-limited inter-VLAN forwarding throughput documented earlier in this document. By doing IPS retroactively via ipset population, Ubiquiti avoids creating a hard single-core verdict bottleneck on the data path — but at the cost of the IPS not actually preventing the malicious traffic it detects. The trade-off makes performance sense; it does not make security sense.
Closed-source surface area: two pieces of closed-source Ubiquiti code interact with GPL software in this pipeline. (1) ubnt-idsips-daemon.so is a Suricata plugin loaded as a .so into the Suricata process — runs in Suricata's address space, links against Suricata's exported plugin API, processes Suricata's internal data structures. (2) ubnt-idsips-daemon is a separate userland daemon that consumes alerts from Suricata's socket and writes to the kernel ipset via netlink. Suricata is GPL-2.0-licensed; whether the plugin is a derived work is the same question raised in Section 14 about the proprietary kernel modules.
dpi-flow-stats has no CPU pinning at all. Its affinity mask reads 0x3ffff (all 18 bits set) — meaning it can run on any core, and the kernel scheduler places it wherever. At the moment of one diagnostic capture, it was running on core 9. Earlier mpstat sampling showed it consuming 39.6% of CPU continuously across whatever cores it landed on, which can and does include the forwarding core for a given flow.
Single-flow softirq lands on a specific core by default for a single TCP flow (RX queue hashing places all packets of one 5-tuple on one queue, which is bound to one core). For the EFG and the typical default RSS configuration, this is often core 0. Whichever core gets the flow becomes the bottleneck core for that flow.
The implication for a single-flow workload (the iperf3 inter-VLAN test, or a Veeam backup, or any single TCP stream):
- The forwarding softirq runs on whichever core RSS picks (typically core 0)
- Suricata's management thread is pinned to core 0 — it competes with forwarding softirq when the flow lands there
- Suricata's verdict thread is pinned to core 1 — separate from forwarding for a flow on core 0, but a constraint if traffic ever lands on core 1
- Suricata's workers are distributed across all cores including the forwarding core
- dpi-flow-stats can land anywhere including the forwarding core, with no restriction
What the lab measurements demonstrate isn't "Suricata is bottlenecking the forwarding core" — Suricata's core consumption is spread by design. What they demonstrate is that userspace processes consuming cycles on the same physical core that's doing forwarding softirq directly reduce that flow's throughput. The forwarding-core contention sources, on this configuration, are: Suricata's management thread (pinned to core 0), Suricata workers (when scheduled to core 0), and dpi-flow-stats (any core including 0). The cumulative effect is observable in per-core CPU samples and reproducible in the lab.
Aggregate vs single-stream: for a multi-flow workload, work spreads across cores via RSS hashing and the per-core contention is less visible because no single core is the bottleneck. The single-stream case is what user-visible problems look like (Veeam replication, large file transfers, single iperf3 streams, individual users on a Fast.com test). That's why the lab measurements isolate single-stream — it's the user-facing failure mode, even though aggregate throughput looks healthier.
Concrete fixes that don't require an architectural rewrite:
- Move Suricata's
management-cpu-setfrom core 0 to core 2 or higher (any core not on the dominant RSS hash path). One-line YAML change. - Pin
dpi-flow-statsaway from core 0 viatasksetor systemdCPUAffinity=. - Configure RSS to hash inter-VLAN flows away from cores 0 and 1 entirely, since both are pinned by Suricata threads.
These are small wins — perhaps 10-20% on single-stream throughput at best — compared to the architectural fixes (flowtable, DPDK + VPP, Suricata 7.0+ in DPDK mode), which deliver 5-25× improvements. But they are real, they cost nothing, and they do not require a kernel rebuild or a feature redesign.
$ mpstat -P ALL 1 3 | grep Average
Average: all 4.07 0.00 3.67 0.07 0.17 0.24 0.00 0.00 0.00 91.78
Average: 0 18.40 0.00 1.39 0.00 0.35 0.00 0.00 0.00 0.00 79.86
Average: 1 13.20 0.00 1.32 0.00 0.33 0.33 0.00 0.00 0.00 84.82
Average: 2 2.68 0.00 2.01 0.00 0.34 0.34 0.00 0.00 0.00 94.63
Average: 3 6.38 0.00 1.68 0.00 0.00 0.00 0.00 0.00 0.00 91.95
[... 14 more cores all near 95–100% idle ...]
91.78% average idle across 18 cores during light load. Under a single-flow stress test the picture is sharper: one core at near-100% softirq (the kernel's softirq context where __netif_receive_skb_core and ip_forward run), seventeen sitting at 0%. Single-flow forwarding is fundamentally a single-thread workload in the Linux kernel network stack: a TCP flow's packets all hash to the same RX queue, the queue is bound to one core, and that core does all the work.
Adding cores does not help. Faster cores help linearly. Removing per-packet kernel-stack work helps dramatically. A userspace dataplane that polls the NIC across multiple worker cores can fix this entirely — see Section 7.
$ ip -br link | grep -E "^br[0-9]"
br0 UP 192.168.196.1/24
br1111 UP [no address shown]
br254 UP 192.168.254.1/24
br3 UP 192.168.3.1/24
br5 UP 192.168.5.1/24
br6 UP 192.168.6.1/24
br7 UP 192.168.7.1/24
Each VLAN gets its own bridge (br3 for VLAN 3, br5 for 5, br6 for 6, etc.) hanging off switch0 subinterfaces (switch0.3, switch0.5, etc.). Inter-VLAN traffic must traverse:
client (VLAN 3) → br3 → switch0.3 → switch0 → kernel L3 lookup
↓
ip_forward
↓
switch0.5 → br5 → client (VLAN 5)
Every L3 hop is a kernel ip_forward operation. A modern vlan-aware single bridge with bridge vlan filtering enabled and nf_flow_table could short-circuit established flows in a software fast-path. This setup cannot.
| Finding | Evidence | Impact |
|---|---|---|
| 5-chain iptables FORWARD | 874 M packets through UBIOS_FORWARD_JUMP in 8 days |
Lab: 4.95 → 2.36 Gbps when applied (53% drop) |
| No flowtable, no module | nft list flowtables empty, lsmod shows no flow_table |
Lab: virtio kernel 2.36 → 7.05 → 17.4 Gbps when added with offloads |
| Userspace inspection competing for forwarding core | dpi-flow-stats 39.6% CPU (no pinning, mask 0x3ffff), Suricata mgmt thread pinned to core 0 (= dominant RSS hash core) | CPU pressure on the specific core handling each flow's softirq |
| Hardware offloads disabled | hw-tc-offload off [fixed], GRO off |
Lab: 17 Gbps (on) → 5 Gbps (off) at MTU 1500 |
| Per-VLAN bridges, no offload | 7 separate br* devices | Forces every inter-VLAN packet through kernel L3 |
| Legacy iptables, not nftables | nft list ruleset empty, 839 iptables rules |
Slower per-rule, locked out of fast-path features |
| Conntrack helpers always-on, no UI toggle | nf_conntrack_{ftp,pptp,h323,tftp} all loaded | Per-packet helper traversal for unused protocols |
| 18 cores, 1 used at a time | mpstat 91.78% idle average; single-flow saturates one core | Single-flow workloads cannot scale across cores in the kernel |
| Old kernel (5.15) | Predates several networking improvements including PPPoE flowtable | Locks out post-5.19 nftables, flowtable, and PPPoE acceleration |
| No DPDK | No cnxk PMD active despite full vendor support |
Forfeits 5-15× throughput available from the same silicon |
The first round of tests used standard virtio-net VMs on Linux bridges — the closest analogue to "hypervisor in front of network silicon" without involving the ConnectX hardware directly. The bridge vmbr1 was configured as VLAN-aware with VIDs 10 and 20.
$ iperf3 -c 10.10.20.10 -t 30
[ ID] Interval Transfer Bitrate
[ 5] 0.00-30.00 sec 59.2 GBytes 16.9 Gbits/sec sender
[ 5] 0.00-30.00 sec 59.2 GBytes 16.9 Gbits/sec receiver
16.9 Gbps. mpstat showed CPU 3 at ~12% softirq during the test. This is what jumbo MTU + GRO/TSO buys you: each "packet" through the forward path is a ~64 KB super-segment that the kernel processes once. Approximately 30,000 forward operations per second, each on one core.
[ 5] 0.00-30.00 sec 60.1 GBytes 17.2 Gbits/sec
17.2 Gbps. Surprisingly similar. With MTU 9000, even without GRO, packets are 8960 bytes each — still only ~6× the per-packet overhead of TSO super-segments. The per-packet kernel cost doesn't dominate yet.
This is the configuration that matches what real Ubiquiti customers experience. Standard internet MTU, no jumbo frames, no offloads.
[ 5] 0.00-30.00 sec 17.3 GBytes 4.95 Gbits/sec
4.95 Gbps. mpstat showed CPU 6 at 100% softirq, all other cores idle. This is the same shape as the EFG diagnostic — one core saturated, the others doing nothing. The Zen 4 core at 5+ GHz, doing nothing but softirq packet forwarding, ceilings at this number.
If we naively scale this for an Octeon ARM core at 2.0 GHz (about 3–5× slower per cycle for this workload), we'd predict ~1.0–1.6 Gbps. Real EFG measurements are in this range. We are reproducing the right physics.
$ sudo modprobe nf_conntrack
$ sudo sysctl -w net.netfilter.nf_conntrack_max=10485760
[ 5] 0.00-30.00 sec 16.9 GBytes 4.84 Gbits/sec
4.84 Gbps. Almost no impact. Module load alone is cheap; conntrack's cost shows up when rules invoke it.
table inet filter {
chain forward {
type filter hook forward priority 0; policy accept;
ct state established,related accept
ct state new accept
}
}
[ 5] 0.00-30.00 sec 16.2 GBytes 4.64 Gbits/sec
4.64 Gbps. A 4% drop from a single conntrack rule. After the first packet of a single-flow iperf3 stream, the conntrack entry exists; lookup is O(1). The cost is real but small for a single long-lived flow.
The full ruleset emulating what we observed on the EFG: 5 jump chains, conntrack per chain, per-rule counters, multiple matchers per rule:
table inet filter {
chain alien_chain { counter; ip protocol tcp counter; ip saddr 10.0.0.0/8 counter }
chain tor_chain { counter; ip protocol tcp counter; tcp flags & (syn|ack) == ack counter }
chain ips_chain { counter; ip protocol tcp counter; meta l4proto tcp counter; tcp dport { 1-65535 } counter }
chain ubios_chain { counter; ip protocol tcp counter; ct state established counter }
chain user_chain { counter; ct state established,related counter; ip saddr 10.10.10.0/24 ip daddr 10.10.20.0/24 counter }
chain forward {
type filter hook forward priority 0; policy accept;
jump alien_chain
jump tor_chain
jump ips_chain
jump ubios_chain
jump user_chain
}
}
[ 5] 0.00-30.00 sec 7.99 GBytes 2.29 Gbits/sec
2.29 Gbps. The smoking gun. A 53% drop from the no-rule baseline of 4.95 Gbps. CPU 5 was pegged at 100% softirq during the entire run.
This is the EFG's per-packet cost on a fast x86 core. Scaling for Octeon ARM at 2.0 GHz: ~500–800 Mbps. Matches user reports of EFG inter-VLAN performance in the wild.
$ iperf3 -c 10.10.20.10 -t 30 -P 8
[SUM] 0.00-30.00 sec 39.7 GBytes 11.4 Gbits/sec
11.4 Gbps aggregate across 8 streams. mpstat showed 2–3 cores busy: different flows hashed to different RX queues, different queues bound to different cores. Multi-flow forwarding scales (somewhat), but single-flow performance does not — each stream caps near the per-core ceiling.
This is why a single backup transfer or large Veeam replication will saturate at 1 Gbps even though the WAN can do 25: the flow is one TCP connection.
We replace the 5-chain ruleset with a flowtable directive:
table inet filter {
flowtable f {
hook ingress priority 0
devices = { enp6s19, enp6s20 }
}
chain forward {
type filter hook forward priority 0; policy accept;
ip protocol { tcp, udp } flow add @f
ct state established,related accept
}
}
[ 5] 0.00-30.00 sec 24.6 GBytes 7.05 Gbits/sec
7.05 Gbps. A 3.0× jump from 2.36 Gbps. flowtable installs an ingress fast-path that, after the first few packets of a flow are tracked, bypasses conntrack lookup and FORWARD chain evaluation entirely. The packet still goes through netfilter ingress hook; the slow path is just skipped.
[ 5] 0.00-30.00 sec 60.9 GBytes 17.4 Gbits/sec
17.4 Gbps. A 7.4× improvement over the EFG-style ruleset baseline (2.36 Gbps). Same hardware. Same kernel. Same single TCP stream. The only changes: flowtable directive added, offloads enabled.
| # | MTU | Offloads | Rules | Single-stream |
|---|---|---|---|---|
| 1 | 9000 | on | none | 16.9 Gbps |
| 2 | 9000 | off | none | 17.2 Gbps |
| 3 | 1500 | off | none | 4.95 Gbps |
| 4 | 1500 | off | + ct module | 4.84 Gbps |
| 5 | 1500 | off | + simple ct rule | 4.64 Gbps |
| 6 | 1500 | off | EFG 5-chain replica | 2.36 Gbps |
| 7 (8-stream) | 1500 | off | EFG 5-chain | 11.4 Gbps agg |
| A | 1500 | off | flowtable | 7.05 Gbps |
| B | 1500 | on | flowtable | 17.4 Gbps |
The virtio tests share a known limitation: virtio-net packets traverse the host's vhost-net kernel thread, which adds its own per-packet cost beyond what's in the guest. To prove that the kernel-stack overheads are independent of virtio's I/O fabric, we ran the same tests with SR-IOV pass-through of ConnectX-6 Dx Virtual Functions.
The ConnectX-6 Dx supports up to 8 SR-IOV Virtual Functions per port. Without disturbing the existing LACP bond:
$ echo 4 > /sys/class/net/enp5s0f0np0/device/sriov_numvfs
$ cat /sys/class/net/enp5s0f0np0/device/sriov_numvfs
4
Four VFs were created (VF0-VF3), assigned dedicated MACs and isolated VLANs (110/120) at the eSwitch level, and passed through to the lab VMs:
- VF0 (
0000:05:00.2) → gw-router as VLAN 10 lab NIC - VF1 (
0000:05:00.3) → gw-router as VLAN 20 lab NIC - VF2 (
0000:05:00.4) → client1 (VLAN 10) - VF3 (
0000:05:00.5) → client2 (VLAN 20)
The ConnectX-6 Dx eSwitch handled L2 between VFs in silicon — no traffic exited the physical port for the VLAN 10/20 lab traffic. The bond and the upstream network were unaffected.
Inside each VM, the VFs appeared as native ConnectX hardware via the mlx5_core driver. The VMs ran kernel ip_forward exactly as before; only the I/O fabric changed.
[ 5] 0.00-30.00 sec 88.3 GBytes 25.3 Gbits/sec
25.3 Gbps single-stream. A 5.1× improvement over the equivalent virtio test (4.95 Gbps with offloads off). With offloads on, ConnectX hardware GRO is more efficient than virtio's, so the per-superpacket cost is even lower.
[ 5] 0.00-30.00 sec 73.9 GBytes 21.1 Gbits/sec
21.1 Gbps. Only a 17% drop. With GRO collapsing wire packets into super-segments, the rule evaluation cost is amortized across ~40× fewer events. The EFG ruleset is still expensive per-event, but per-packet on the wire it's hidden by GRO.
[ 5] 0.00-30.00 sec 16.6 GBytes 4.74 Gbits/sec
4.74 Gbps. Statistically identical to the virtio-net test (4.95 Gbps). With offloads off, every wire packet hits ip_forward once. The per-packet ceiling on a Zen 4 core is the same regardless of NIC quality. The kernel stack itself is the bottleneck, not the I/O fabric, when offloads are off.
[ 5] 0.00-30.00 sec 16.4 GBytes 4.70 Gbits/sec
4.70 Gbps. Same as K3 within noise. The mlx5 kernel I/O path is heavier per-packet than virtio's vhost-net path — heavy enough that the EFG ruleset cost is hidden inside the I/O cost. Both paths still cap at the single-core software ceiling.
| # | NIC | Offloads | Rules | Single-stream |
|---|---|---|---|---|
| K1 | ConnectX VF | on | none | 25.3 Gbps |
| K2 | ConnectX VF | on | EFG 5-chain | 21.1 Gbps |
| K3 | ConnectX VF | off | none | 4.74 Gbps |
| K4 | ConnectX VF | off | EFG 5-chain | 4.70 Gbps |
The pattern is clear: with offloads off, the I/O fabric does not matter. With offloads on, it does. Hardware offloads collapse the per-packet processing cost in the kernel's hot path. Without them, even the world's fastest networking silicon ceilings around 5 Gbps single-stream because the kernel itself is the limit.
The EFG configuration disables hardware offloads. By doing so, it makes its own silicon irrelevant.
VPP (Vector Packet Processor) is a userspace network dataplane built on DPDK that bypasses the kernel network stack entirely. It is what production-grade open-source routers (TNSR, DANOS) use, and it is what most enterprise-grade NFV appliances build on. We tested it both over virtio-net and over the ConnectX VFs.
A note on relevance to the EFG: Marvell ships a fully-supported DPDK Poll Mode Driver for the OCTEON family — the cnxk PMD, which covers CN9670 (in the EFG) and CN10K (in the UDM Beast). Marvell publishes reference architectures that combine OCTEON SoCs with VPP and DPDK-accelerated Suricata. Suricata itself has had native DPDK input mode since version 7.0 (released 2023). The components Ubiquiti would need to ship a userspace dataplane on the EFG are not research projects — they are vendor-blessed, production-deployed infrastructure that has been available for years.
[ 5] 0.00-30.00 sec 23.7 GBytes 6.78 Gbits/sec
6.78 Gbps. Roughly equal to ip_forward + flowtable in the equivalent kernel test. VPP's show runtime revealed the cause:
dpdk-input Vectors/Call: 0.05 Clocks/Packet: 1810
ip4-rewrite Vectors/Call: 15.24 Clocks/Packet: 24.2
0.05 vectors per call on the input side. DPDK's whole performance story is amortizing per-syscall and per-context-switch overhead across batches of ~32–256 packets. Virtio-net feeds packets to DPDK one at a time. The polling loop is essentially empty. Userspace dataplane only delivers its promised speedup when paired with a userspace-friendly I/O backend (vhost-user) or real hardware.
[ 5] 0.00-30.00 sec 54.9 GBytes 15.7 Gbits/sec
15.7 Gbps. Better than virtio-VPP (3× better) but actually worse than kernel-on-ConnectX with offloads on (25.3 Gbps). Why? VPP doesn't do GRO. It processes wire packets individually. With offloads off on the clients, every packet on the wire is 1500 bytes, and VPP processes ~1.4 million per second on one worker core.
The per-packet path through VPP is impressively cheap (ip4-input + lookup + rewrite + tx ≈ 78 cycles end-to-end on Zen 4) but it's still doing 40× more "work events" than the kernel + GRO setup, which only sees super-segments.
[ 5] 0.00-30.00 sec 124 GBytes 35.6 Gbits/sec
35.6 Gbps single-stream. Now the clients send fewer, larger TCP segments via TSO. ConnectX hardware can transmit each segment as a single frame on the wire (GSO/TSO offload at the NIC). VPP receives the resulting larger frames and forwards them with its low per-packet cost.
This is the headline number. 35.6 Gbps single-stream userspace dataplane forwarding on real silicon. Compared against the EFG's actual production performance on the same workload (~1 Gbps), this is the 15-35× ceiling that's possible with available open-source software on the same class of hardware.
VPP with show runtime during this test:
ip4-rewrite Vectors/Call: 7.54 Clocks/Packet: 38.6
lab-vlan20-tx Vectors/Call: 8.65 Clocks/Packet: 37.9
VPP itself is doing 75–80 cycles of work per packet. On a 5 GHz core that's ~16 ns per packet. The theoretical ceiling for VPP on this hardware is hundreds of Gbps. The measured 35.6 Gbps is bottlenecked on the clients (their ability to generate packets), not on VPP.
The lab numbers are on Zen 4 at 5+ GHz. To estimate what VPP+DPDK would achieve on the EFG's ARM Cortex-A72-class cores at 2.0 GHz, we lean on published Marvell numbers and the cycle-counting visible in show runtime:
- VPP per-packet cost in the lab: ~80 cycles on Zen 4 for full IP forwarding pipeline
- ARM Cortex-A72 vs Zen 4 IPC for this workload: ~3-4× lower
- Estimated cycles per packet on Octeon CN9670: 240-320 cycles
- At 2.0 GHz: 6.25-8.3 million packets per second per core
- At 1500-byte MTU: 9-12 Gbps single-stream per worker core
- The Octeon CN9670 has dedicated NIX hardware engines that can offload portions of this further
Marvell's own published cnxk PMD benchmarks show single-core forwarding rates of 15-30 Mpps (millions of packets per second) for simple L3 forwarding, which corresponds to 18-36 Gbps at 1500 MTU per core. Across 4-6 worker cores (leaving control plane and inspection cores untouched), aggregate forwarding capacity easily reaches the 50 Gbps line rate of the EFG's two 25G ports, and single-stream throughput in the 15-25 Gbps range is realistic.
This means: on the same EFG silicon, with no hardware changes, a properly-architected DPDK dataplane should deliver 10-25× the inter-VLAN throughput the device achieves today, and eliminate the inspection-vs-forwarding CPU contention by giving each worker its own dedicated core with a vendor-supported PMD.
Many enterprise customers (especially in countries where fiber-to-the-business is delivered via GPON or XGS-PON with PPPoE authentication) report that even when they have a 10 Gbps fiber link, single-stream throughput across their EFG WAN tops out around 2–3 Gbps. This is a separate bottleneck from inter-VLAN routing, but it has the same architectural root cause — and arguably worse manifestation, because the PPPoE path forces the kernel through multiple softirq passes per packet.
PPPoE encapsulates IP traffic in PPP frames inside Ethernet (ether_type 0x8864/0x8863). Every WAN packet must:
- Be encapsulated/decapsulated by the
pppoe.kokernel module on every transit - Have its effective MTU reduced to 1492 bytes (eight bytes of PPPoE header), increasing per-packet overhead and forcing Path MTU Discovery
- Be processed by
pppdin userspace for LCP/IPCP control plane and link state — packet flow events get notified to userspace - Pass through additional packet copy for encapsulation/decapsulation in software
- Bypass the kernel's flowtable fast-path — until kernel 6.2,
nf_flow_tablehad no PPPoE support at all; flows traversing PPPoE could not be offloaded - Make multiple distinct kernel-stack passes: ingress on the underlying VLAN (eth2.11) → softirq 1 → pppoe_rcv → ip_input → ip_forward → ip_output → softirq 2 → pppoe_xmit → egress on the same or different VLAN
Combined with the per-packet kernel forward cost we measured (4.74 Gbps ceiling on a single Zen 4 core with offloads off), the additional encap/decap work, and the multi-pass softirq pattern, PPPoE single-stream throughput is fundamentally bound by:
- Single-core ip_forward + pppoe.ko packet handling, which on a 2 GHz Octeon core lands in the 1-3 Gbps range — exactly what users report
- No flowtable PPPoE acceleration (kernel 5.15 doesn't have it; the EFG runs 5.15)
- Multiple softirq cores chained together, each handling part of the encap/decap/forward chain — this spreads CPU load across cores but adds latency and inter-core cache misses without actually speeding anything up
- No DPDK PPPoE termination (would require accel-ppp or VPP's native PPPoE plugin in userspace)
The following data was captured during a single Netflix Fast.com speed test from a client device on the LAN, using the EFG's PPPoE WAN connection (Vivo XGS-PON, Brazilian ISP requiring PPPoE auth, link rated 1 Gbps but the same softwarepath would be used on a 10 Gbps link).
$ top -bn1 -d 1 | head -15
top - 03:43:15 up 8 days, 6:51, load average: 3.81, 2.62, 2.33
%Cpu(s): 5.5 us, 5.0 sy, 0.0 ni, 52.5 id, 0.0 wa, 1.3 hi, 35.6 si, 0.0 st
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
23 root 20 0 0 0 0 R 100.0 0.0 17:11.14 ksoftirqd/2
48 root 20 0 0 0 0 R 100.0 0.0 4:13.86 ksoftirqd/7
63 root 20 0 0 0 0 R 100.0 0.0 21:58.12 ksoftirqd/10
83 root 20 0 0 0 0 R 72.2 0.0 10:05.83 ksoftirqd/14
73 root 20 0 0 0 0 R 66.7 0.0 16:39.62 ksoftirqd/12
12 root 20 0 0 0 0 R 55.6 0.0 16:02.71 ksoftirqd/0
2491041 root 5 -15 1768064 1.1g 19584 S 44.4 1.8 6:18.31 Suricata-Main
3139 root 5 -15 383232 68736 28416 S 22.2 0.1 1495:37 ubios-udapi-ser
8596 root 20 0 20.1g 647232 85440 S 16.7 1.0 474:44 unifi-core
This is the smoking gun for PPPoE. Six different ksoftirqd threads are running at 55-100% simultaneously — cores 0, 2, 7, 10, 12, and 14 — all chewing through softirq work for what is fundamentally a single-flow workload (one TCP stream from Fast.com's backend server, through the PPPoE WAN, to the LAN client).
The reason this is even worse than the inter-VLAN smoking gun: inter-VLAN forwarding has one core saturated. PPPoE has multiple cores in continuous softirq because the path itself does multiple distinct kernel-stack passes per packet (eth2.11 ingress → pppoe_rcv → ip_input → ip_forward → ip_output → pppoe_xmit → eth2.11 egress). Each pass can land on a different core via softirq scheduling. The kernel is doing more total work per packet and spreading it across cores in a way that creates cache-coherence overhead between cores. It's the worst of both worlds — single-flow throughput limited by per-core ceiling, but multi-core CPU consumption.
The corresponding mpstat -P ALL output confirms the picture:
03:43:24 CPU %usr %sys %irq %soft %idle
03:43:24 all 5.65 2.74 1.01 32.49 58.00
03:43:24 0 50.55 0.00 0.00 49.45 0.00
03:43:24 2 0.00 0.00 1.00 81.00 18.00
03:43:24 6 0.00 0.00 0.00 61.62 38.38
03:43:24 10 1.01 1.01 2.02 66.67 29.29
03:43:24 14 0.00 0.00 0.00 85.29 14.71
03:43:24 17 0.00 0.00 0.00 100.00 0.00
Six cores at 50-100% softirq during a single Fast.com speed test. The aggregate %soft of 32.49% across 18 cores corresponds to ~5.85 cores fully consumed by softirq work — for one flow.
While ksoftirqd is burning multiple cores, the inspection processes are also running:
Suricata-Main 44.4% CPU
ubios-udapi-ser 22.2% CPU
unifi-core 16.7% CPU
ulogd 5.6% CPU
That's ~89% of one core equivalent of additional userspace work, often landing on the same cores doing softirq. The result: the cores doing softirq are being preempted by userspace, and the userspace processes are being preempted by softirq, in a continuous round-robin that prevents either from getting clean cycles.
$ lsmod | grep -i ppp
pppoe 327680 2
pppox 262144 1 pppoe
ppp_generic 327680 6 pppox,pppoe
slhc 262144 1 ppp_generic
$ ps -eo pid,pcpu,comm,args | grep pppd
2878806 0.0 pppd /usr/sbin/pppd call ppp1 nodetach
The full software PPPoE stack is loaded: pppoe.ko for PPPoE-specific encap, pppox.ko for PPP-over-X dispatch, ppp_generic.ko for the PPP framing engine, slhc.ko for VJ header compression, and pppd in userspace for control plane (LCP, IPCP, keepalives). Every WAN packet traverses all of these in sequence.
$ ethtool -k ppp1 | grep -E "tcp-segmentation|generic-(receive|segmentation)|large-receive|hw-tc-offload"
tcp-segmentation-offload: off
tx-tcp-segmentation: off [fixed]
generic-segmentation-offload: off [requested on]
generic-receive-offload: on
large-receive-offload: off [fixed]
hw-tc-offload: off [fixed]
The [fixed] flag means the kernel module returns "this feature cannot be enabled" — they are hardcoded off in the ppp_generic driver. Even when generic-segmentation-offload was requested on (probably by some default state), the kernel refused. Pseudo-interfaces like ppp1 inherently can't do hardware TSO/LRO because there's no hardware behind them — it's a software encap layer. That's normal Linux behavior, but it means every PPPoE WAN packet gets TX-fragmented and RX-aggregated in software before being handed to or received from the underlying VLAN.
Note that generic-receive-offload: on does work for the receive path — but TSO does not exist on the egress side, so every outbound packet traverses the kernel stack individually.
$ modinfo nf_flow_table 2>&1
modinfo: ERROR: Module nf_flow_table not found.
$ find /lib/modules/$(uname -r) -name "nf_flow_table*"
[no output]
Not just unloaded — the kernel module doesn't exist on the system. nf_flow_table.ko is not compiled into Ubiquiti's 5.15.72-ui-cn9670 kernel build, nor is it available as an external module file. Even with root access, a customer cannot load the module to enable flowtable acceleration. The fast-path infrastructure isn't shipped at all.
(Note: PPPoE-specific flowtable handling lives inline within nf_flow_table.ko itself, not as a separate module. There is no nf_flow_table_pppoe.ko in mainline Linux; the PPPoE protocol checks and nf_flow_pppoe_proto() helpers are part of nf_flow_table_ip.c and nf_flow_table_inet.c, both of which compile into nf_flow_table.ko and nf_flow_table_inet.ko respectively. So the absence of nf_flow_table.ko is the absence of all flowtable functionality, including PPPoE acceleration.)
$ ip link show ppp1
ppp1: <POINTOPOINT,MULTICAST,NOARP,UP,LOWER_UP> mtu 1492
$ ip link show eth2.11
eth2.11@eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500
ppp1 MTU 1492 (1500 - 8 byte PPPoE header), eth2.11 MTU 1500. Every payload is 8 bytes smaller than it could be on raw Ethernet, increasing packet count for the same throughput. Small effect compared to the per-packet kernel cost, but it adds up at line rate.
A reasonable question: PPPoE looks complicated, with control plane (PADI/PADO/PADR/PADS handshake, LCP/IPCP negotiation, keepalives, RADIUS) and dataplane (packet encap/decap) entangled. Can DPDK actually handle this, or is it fundamentally a kernel concept?
DPDK handles it well, but with a different architecture than the kernel uses.
The kernel's approach: pppoe.ko is a single module that does both control plane (handshake, LCP/IPCP, keepalives) and dataplane (encap/decap of every packet). Both run in softirq context, on whatever cores the kernel scheduler picks. The result is what we just measured: control plane and dataplane fighting for the same cores, with userspace processes (pppd) added on top.
DPDK splits this in two:
-
Control plane stays in userspace as a regular process. Tools like
accel-ppp(the most common open-source PPPoE BNG implementation, deployed by ISPs to terminate hundreds of thousands of sessions per box) handle PADI/PADO/PADR/PADS, LCP/IPCP, keepalives, session lifecycle, RADIUS authentication — everything that happens at session establishment or once per second per session. This doesn't need to be fast; it needs to be correct. accel-ppp added DPDK support around 2020 and is what ISP-grade BNGs use today. -
Dataplane runs as a fixed-cost pipeline stage. Once the session is up, every packet just needs an 8-byte header push (egress) or pop (ingress). In VPP (which has had a native PPPoE plugin since 2018), it's literally a node in the packet processing graph:
dpdk-input → ethernet-input → pppoe-input → ip4-input → ip4-lookup
→ ip4-rewrite → pppoe-encap → interface-output
The pppoe-input and pppoe-encap nodes are tiny — they push or pop 8 bytes, update some counters, and pass the packet to the next node in the same vector batch. Per-packet overhead for adding PPPoE to a VPP pipeline is roughly 30-50% above plain L3 forwarding, not 5-10× like the kernel softirq path imposes.
The critical difference: the kernel does control plane + dataplane on the same softirq path, blocking everything. DPDK does control plane in a slow, one-time-per-session userspace daemon, and dataplane as a small fixed-cost pipeline stage running on dedicated worker cores at line rate.
On Marvell silicon specifically: the Octeon CN9670 (the EFG SoC) is explicitly marketed by Marvell as a "Smart NIC and BNG" SoC. Their reference architectures combine:
- The
cnxkDPDK PMD handling raw Ethernet frames at line rate from the NIX hardware engines - accel-ppp running in userspace on dedicated control-plane cores, handling PPPoE control plane
- Dataplane integrated into VPP's PPPoE plugin or a custom DPDK pipeline
- Suricata in DPDK mode tapping the dataplane for inspection on dedicated worker cores
ISPs deploying this stack on Octeon hardware regularly hit 40+ Gbps PPPoE termination per box with 100K+ concurrent sessions. Companies like Calix, Adtran, and a handful of NFV vendors ship enterprise BNGs based on exactly this silicon, doing exactly this PPPoE workload, at 25+ Gbps per port. This isn't research — it's commodity, vendor-blessed, production-deployed infrastructure that has existed for years.
Two independent fix paths exist:
Kernel path: Linux 6.2 (released February 2023) added PPPoE handling support to nf_flow_table. The PPPoE protocol checks and nf_flow_pppoe_proto() helpers were added inline within nf_flow_table_ip.c, nf_flow_table_inet.c, and nf_flow_table_offload.c — i.e., they compile directly into nf_flow_table.ko and nf_flow_table_inet.ko, not as a separate module. Once enabled, established TCP/UDP flows over PPPoE WAN can be offloaded to the same software fast-path as native L3 traffic, bypassing both pppoe.ko and the netfilter slow path for in-progress flows. Combined with hardware tc-flower offload on supported NICs, modern Linux distros (OpenWrt 23.05+, recent VyOS, MikroTik RouterOS 7) achieve near-line-rate PPPoE throughput on 10 Gbps links through software fast-path acceleration.
The EFG ships kernel 5.15 — released in late 2021, predating PPPoE flowtable acceleration by over a year. A kernel rebase to 6.6 LTS or later, with nf_flow_table.ko loaded and a flowtable directive added to nftables, would dramatically improve PPPoE WAN throughput without any hardware changes and without changing the dataplane architecture. The fix is a kernel module load and one nftables stanza.
DPDK path: Migrate the PPPoE termination from pppoe.ko in the kernel to accel-ppp + VPP's PPPoE plugin in userspace, on dedicated worker cores. This is the same architectural change as Fix 3 in Section 11 (DPDK + VPP for the dataplane), with PPPoE just being one more pipeline stage. Since Marvell ships full DPDK support for the Octeon CN9670 and publishes reference architectures combining DPDK + accel-ppp + VPP, this is integration work, not invention.
Using the same scaling from Section 7.1:
| Configuration | Single-stream PPPoE throughput | Notes |
|---|---|---|
| Current EFG (kernel 5.15, no flowtable, software pppoe.ko) | ~2-3 Gbps | per user reports; matches our multi-core ksoftirqd evidence |
| EFG + kernel 6.6 + nf_flow_table loaded + flowtable rule | ~5-8 Gbps | flowtable bypasses pppoe.ko + netfilter for established flows |
| EFG + kernel 6.6 + flowtable + hw-tc-offload | ~8-9.5 Gbps | near line-rate on 10G PPPoE links |
| EFG + DPDK (accel-ppp + VPP PPPoE plugin) | line rate on 10 Gbps (and 25G aggregate) | what ISP-grade BNGs achieve on this exact silicon |
The point: PPPoE performance is not a hardware problem either. It is the same architectural failure (single-core kernel forwarding without acceleration) compounded by an additional encapsulation layer that mainline Linux now supports accelerating, and that DPDK has handled at line rate for years. The same fixes apply, with PPPoE benefiting more than inter-VLAN does because the multi-pass softirq pattern is so much more expensive in the current implementation.
The analysis up to this point is grounded in measurements of one Ubiquiti device, the EFG, plus a controlled lab reproduction on x86. A reasonable counter-argument is that the EFG might be an outlier — older silicon, an early-cycle product, an aberrant kernel build that newer products have moved past.
To address this, the same diagnostic methodology was applied to a second-generation Ubiquiti gateway: the UDM Beast. This is a newer, higher-end Ubiquiti product running a different SoC family (Marvell Octeon CN10K instead of CN9K), a substantially newer kernel (6.6.46 vs 5.15.72), and — critically — a dedicated switching ASIC that the EFG does not have.
The diagnostic question was: does the newer silicon, the newer kernel, and the dedicated switching ASIC change the inter-VLAN routing architecture?
It does not. The UDM Beast exhibits the same architectural pattern, with one important new wrinkle: the ASIC is physically present, powered up, and processing billions of packets — but only for intra-VLAN switching. Inter-VLAN routing continues to go through the same kernel software path as the EFG.
$ cat /proc/cpuinfo | head
processor : 0
BogoMIPS : 100.00
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp
asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4
asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs sb
paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3
svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
CPU implementer : 0x41
CPU architecture: 8
CPU variant : 0x0
CPU part : 0xd49
CPU revision : 0
Decoded:
0x41= ARM Holdings (the actual ARM, not Marvell-customized)0xd49= ARM Neoverse N2- 8 cores @ 2.5 GHz
- ARMv8.2-A with the full extension set: SVE2, BF16, I8MM, RNG, BTI, pointer authentication
The kernel reports as 6.6.46-ui-cn10k. CN10K is Marvell's OcteonTX 10, the next-generation networking SoC family after the EFG's CN9670. CN10K specifically uses ARM Neoverse cores instead of Marvell-customized ones, with substantially higher per-core throughput.
For comparison:
| Property | EFG | UDM Beast |
|---|---|---|
| SoC | Marvell Octeon CN9670 | Marvell Octeon CN10K |
| ARM core | Marvell custom (A57-class) | ARM Neoverse N2 |
| Clock | 2.0 GHz | 2.5 GHz |
| Cores | 18 | 8 |
| Per-core IPC vs Zen 4 | ~0.55× | ~0.75× |
| Predicted single-core throughput | 22% of Zen 4 | 38% of Zen 4 |
| RAM | 16 GB | 32 GB |
| Kernel | 5.15.72-ui-cn9670 | 6.6.46-ui-cn10k |
So per-core, the UDM Beast should single-thread inter-VLAN route at roughly 2× the EFG's rate. That's still a long way from saturating the 25G ports both devices ship with.
The UDM Beast has a switch0 interface that aggregates physical Ethernet ports eth2 through eth13 as named slaves:
$ ip link show | grep -E '@switch0' | head -8
eth2@switch0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 ...
eth3@switch0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 ...
eth4@switch0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 ...
eth5@switch0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 ...
eth6@switch0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 ...
[...]
The switch0 virtual interface itself reports massive packet throughput:
$ ip -s link show switch0
switch0: ...
RX: bytes packets ...
1350339982078 1274250904 ...
That's 1.27 billion packets / 1.35 TB processed by the switch interface. This traffic does not appear on any individual physical Ethernet's RX/TX counters in the same volume — the kernel sees the aggregate but not the per-port breakdown, because the per-port traffic is being switched in hardware below the kernel's visibility.
A daemon process is actively managing the ASIC:
$ ps -eo pid,pcpu,comm | head -5
2617 root 75.0 cpss-manager
2913 root 0.0 cpss-app
$ ls -la /proc/$(pgrep -f 'cpss-app.*l3')/fd
0 -> /dev/null
3 -> /dev/shm/CPSS_SHM_MALLOC0 ← shared memory with cpss-manager
4 -> /dev/mvdma ← Marvell DMA device (direct ASIC access)
5 -> /sys/devices/.../0008:01:00.0/resource0 ← raw PCIe BAR0 of switch ASIC
6 -> /sys/devices/.../0008:01:00.0/resource2 ← PCIe BAR2
$ cat /proc/$(pgrep -f 'cpss-app.*l3')/cmdline
/usr/bin/cpss-app -l3
CPSS stands for CPU Subsystem Services — Marvell's proprietary management framework for their Prestera switch ASIC family. The cpss-app -l3 invocation suggests the ASIC supports L3 forwarding capability (the -l3 flag), and the daemon has direct memory-mapped PCIe access to the chip via /dev/mvdma and the resource0/resource2 BARs.
Notably, cpss-manager consumes 75% of one CPU core continuously — that's roughly 6% of the entire box's CPU just on switch ASIC management overhead. This is not anomalous load; it's the steady-state cost of managing the ASIC through Marvell's proprietary framework rather than the kernel's switchdev infrastructure.
switchdev is the Linux kernel framework for offloading network functions to switching hardware. When properly engaged, it lets tc flower rules, bridge VLAN filtering, and L3 routing be programmed into the switch ASIC's hardware tables, with the kernel staying out of the per-packet path entirely.
On the UDM Beast, switchdev is not engaged. Every interface — physical port, bridge, virtual switch interface — reports the same offload flag pattern:
$ for iface in br0 br10 br20 br30 br50 br199 br200 \
eth1 eth6 eth8 eth9 eth11 eth12 eth13 switch0; do
echo "=== $iface ==="
ethtool -k $iface 2>/dev/null | grep -E "hw-tc-offload|l2-fwd-offload|rx-vlan-filter"
done
=== br0 ===
rx-vlan-filter: off [fixed]
l2-fwd-offload: off [fixed]
hw-tc-offload: off [fixed]
=== br10 ===
rx-vlan-filter: off [fixed]
l2-fwd-offload: off [fixed]
hw-tc-offload: off [fixed]
[... same pattern repeats for every interface ...]
=== switch0 ===
rx-vlan-filter: off [fixed]
l2-fwd-offload: off [fixed]
hw-tc-offload: off
The [fixed] qualifier is critical. It means the kernel driver does not even expose these features as toggleable. Compare this to a Linux box with a properly-supported switchdev driver (such as Mellanox/NVIDIA Spectrum or Marvell Prestera with the upstream prestera driver), where these flags would read on or be administrator-toggleable.
[fixed] here means the driver doesn't implement the switchdev API. The Linux kernel cannot push hardware-offloadable rules to the ASIC, because there's no driver code path to do so.
Bridge phys_switch_id files are present on every interface but are empty:
$ for nic in $(ls /sys/class/net/); do
sw=$(cat /sys/class/net/$nic/phys_switch_id 2>/dev/null)
[ -n "$sw" ] && echo "$nic -> $sw"
done
[no output]
A populated phys_switch_id is how the kernel identifies multiple netdevs as belonging to the same hardware switch — a precondition for switchdev L2 forwarding offload. The files exist (so the netdev infrastructure has been initialized) but are unset, so the kernel does not know that eth2, eth3, etc. are ports of a common switch. Without that knowledge, no offload decision is possible.
tc filter rules attached to the WAN interfaces explicitly report not_in_hw:
$ tc -s filter show dev eth8 ingress
filter parent ffff: protocol all pref 49152 u32 chain 0 fh 800::800 ... not_in_hw
match 00000000/00000000 at 0
action order 1: connmark zone 0 pipe
Sent 67350180932 bytes 70901285 pkt
action order 2: mirred (Egress Redirect to device ifbeth8) stolen
Sent 67350180932 bytes 70901285 pkt
The not_in_hw flag is the kernel telling you, in plain text: this filter rule is running in software. Each WAN-side packet is being:
- Pulled from the NIC into a kernel skb
- Classified by a u32 filter on the CPU
- Connmark-zoned on the CPU
- Mirrored to an
ifbeth8IFB (intermediate functional block) for traffic shaping on the CPU - Then run through the iptables FORWARD chain on the CPU
- Then forwarded out
The aggregate counter — 67.3 GB / 70.9 million packets — represents traffic that has gone through that entirely-CPU path on the primary WAN. None of it benefited from the switch ASIC sitting on the same PCB.
$ iptables -L FORWARD -nv | head -10
Chain FORWARD (policy ACCEPT)
pkts bytes target
358M 755 GB ALIEN
366M 769 GB TOR
557M 1159 GB IPS
557M ----- UBIOS_FORWARD_JUMP
Same multi-chain pattern as the EFG: ALIEN for general matching, TOR for Tor-related rules, IPS invoking Suricata-related work, and UBIOS_FORWARD_JUMP chaining into Ubiquiti's deeper rule set. Several hundred million packets have traversed each chain.
A more revealing chain on the UDM Beast is UBIOS_PREROUTING_PBR (policy-based routing), which is even more extensive than what the EFG ships. It contains numerous ipset matches, NFLOG actions, MARK manipulations, and — notably — L7 application classification tags written into ipsets by the userspace dpi-flow-stats daemon:
MARK ... cat 3 app 156 cat 3 app 150 cat 20 app 186 ...
Every packet hits this chain. The DPI pipeline that consumed 39.6% of one core continuously on the EFG (Section 4.5) is doing the equivalent work here, with the additional cost of L7 application-level matching against ipsets that userspace populates.
conntrack continues to track every flow:
$ wc -l /proc/net/nf_conntrack
31245 /proc/net/nf_conntrack
$ cat /proc/sys/net/netfilter/nf_conntrack_max
2097152
31,245 active conntrack entries on a low-traffic snapshot, against a 2-million-entry maximum. Every flow — including inter-VLAN — has a conntrack entry that is created and updated on every packet. No flowtable shortcut, no offload to the ASIC.
$ devlink dev info
pci/0002:01:00.0: driver rvu_af
pci/0002:06:00.0: driver rvu_nicpf
pci/0002:01:00.1-7, 01:01.0-7, 01:02.0: driver rvu_nicvf (16 VFs)
pci/0002:20:00.0: driver rvu_cptpf
pci/0002:02:00.0, 03:00.0, 04:00.0, 05:00.0: driver rvu_nicpf
These are OcteonTX RVU (Resource Virtualization Unit) drivers — the standard upstream Linux drivers for Marvell's CN10K NIC silicon. The CN10K platform supports DPDK, hardware flow tables, and XDP offload through these drivers when properly configured. None of those features are enabled on the UDM Beast.
The drivers are loaded as plain netdevs with no offload features active — the tc-offload flag fixed-off, no xdp programs attached, no flow_offload infrastructure engaged. The hardware capability is present at the silicon level. The software does not use it.
Putting it all together, the path of an inter-VLAN packet on the UDM Beast is:
- Packet arrives on a switch port (e.g., a host on VLAN 10 sending to VLAN 20)
- ASIC receives the packet, recognizes it as targeting a different VLAN (or just doesn't have an L3 entry programmed)
- ASIC punts the packet to the CPU through the
switch0.10virtual port - Linux bridge
br10receives the punted frame - Kernel routing decision:
ip_forwardlooks up the destination, decides it goes outbr20 - Packet traverses the iptables FORWARD chains (
ALIEN,TOR,IPS,UBIOS_FORWARD_JUMP,UBIOS_PREROUTING_PBR) - Conntrack updates the flow entry
- dpi-flow-stats classifies the packet at L7 and may update an ipset
- The packet is sent via
br20toswitch0.20 - ASIC switches the now-tagged-VLAN-20 frame out the destination port
Steps 4 through 9 happen on a single CPU core in softirq context. The ASIC is bypassed for the routing decision; it only handles the L2 hop on either side of the kernel detour.
This is the same architectural pattern as the EFG, with two differences: the CPU is faster (so the bottleneck moves to a higher floor — perhaps 2-4 Gbps single-stream instead of 1-2), and there is a dedicated switching ASIC sitting unused for the routing path.
The EFG findings could be argued away as a one-off — older silicon, an early product, a lapsed kernel build. The UDM Beast diagnostic forecloses that argument:
- Different SoC family (CN10K, not CN9K)
- Different ARM cores (Neoverse N2, not Marvell-custom)
- Newer kernel by 18 months (6.6, not 5.15)
- Dedicated switching ASIC present (Prestera-class via PCIe)
- Different driver stack (
rvu_*family on a 6.6 kernel, with CN10K-era features available)
And yet:
- Same iptables architecture
- Same conntrack-on-every-packet pattern
- Same userspace DPI sitting on the data path
- Same Suricata IPS in pcap mode
- No switchdev offload (across an entire generation of new silicon)
- No flowtable
- No DPDK
- Software-only L3 path — proven by the
not_in_hwfilter tags and the 67 GB/70.9M packet counter on the WAN's CPU-mirred chain
This is a multi-generation pattern. Ubiquiti has shipped at least two generations of silicon with substantially different capabilities, both running fundamentally the same software stack, both leaving the silicon's hardware acceleration unused for the inter-VLAN forwarding path. Whatever is preventing the architectural fix is not a hardware constraint and not a kernel-version constraint. It is a software architecture decision that has persisted across product cycles.
The performance gap between the EFG and UDM Beast is roughly the per-core IPC ratio of their CPUs — exactly what you'd predict if the bottleneck were the kernel forwarding path running on one core. A faster CPU moves the floor up. It does not fix the architecture.
A reasonable counter-argument to everything above would be: "Maybe building a hardware-accelerated forwarding integration on a network gateway is just hard, and Ubiquiti hasn't gotten there yet on any product." That argument fails the moment you look at the UCG Fiber.
The UniFi Cloud Gateway Fiber (UCG-Fiber) is one of Ubiquiti's compact desktop gateways, retailing at approximately $279, advertised at "5 Gbps IDS/IPS throughput" with three 10 Gbps ports and four 2.5 Gbps ports. It runs on the Qualcomm IPQ9574 SoC — quad-core ARM Cortex-A73 at 2.2 GHz with 3 GB RAM. A reader of the gist provided diagnostics from a production UCG Fiber:
$ uname -a
Linux UCG-ironionet 5.4.213-ui-ipq9574 #5.4.213 SMP PREEMPT Wed Apr 29 ... aarch64 GNU/Linux
$ ls /usr/share/ubios-udapi-server/
ips/ ips_6/ ips_8/
$ ps -ef | grep suricata
/usr/share/ubios-udapi-server/ips_6/suricata/bin/suricata --pcap
--pidfile /run/suricata.pid
-c /usr/share/ubios-udapi-server/ips_6/config/suricata_ubios_high.yaml
The Suricata setup is identical to the EFG: same ips_6/ and ips_8/ directory structure with EOL Suricata 6.0.12 active and Suricata 8.0.2 staged-but-unused, same --pcap runtime, same suricata_ubios_high.yaml configuration. The IDS/IPS architecture is portable across product lines.
But the data path is completely different:
$ lsmod | grep nss
qca_nss_sfe 1273856 1 ecm
qca_nss_ppe_lag 20480 0
qca_nss_ppe_ds 24576 0
qca_nss_ppe_qdisc 102400 0
qca_nss_ppe_pppoe_mgr 16384 0
pppoe 24576 3 qca_nss_sfe,ecm,qca_nss_ppe_pppoe_mgr
qca_nss_ppe_bridge_mgr 32768 0
qca_ovsmgr 45056 3 qca_mcs,ecm,qca_nss_ppe_bridge_mgr
qca_nss_ppe_vlan 49152 2 qca_nss_ppe_lag,qca_nss_ppe_bridge_mgr
qca_nss_ppe_vp 69632 3 qca_nss_ppe_vlan,ecm,qca_nss_ppe_ds
qca_nss_dp 147456 2 qca_nss_ppe_vp,qca_nss_ppe_ds
bonding 135168 3 qca_nss_ppe_vlan,ecm,qca_nss_ppe_pppoe_mgr
qca_nss_ppe 380928 9 qca_nss_dp,qca_nss_ppe_vp,qca_nss_ppe_vlan,...
qca_ssdk 2191360 4 qca_nss_dp,qca_nss_ppe
This is Qualcomm's NSS (Network Sub-System) PPE (Packet Processing Engine) stack — a hardware data path running on a dedicated network coprocessor inside the IPQ9574 SoC, separate from the main ARM cores. The relevant pieces:
qca_nss_ppe— the core PPE driver, the umbrella moduleqca_nss_ppe_pppoe_mgr— hardware-offloaded PPPoE session managementqca_nss_ppe_vlan— hardware VLAN tag handlingqca_nss_ppe_bridge_mgr— hardware L2 bridgingqca_nss_ppe_lag— hardware link aggregationqca_nss_ppe_ds— direct switching (port-to-port without CPU involvement)qca_nss_ppe_vp— virtual ports (for VLAN sub-interfaces)qca_nss_ppe_qdisc— hardware queue discipline (QoS in hardware)qca_nss_sfe— Shortcut Forwarding Engine, the "fast path" that bypasses the kernel for established flowsecm— Enhanced Connection Manager, the userspace daemon that programs flows into the SFE/PPEqca_ssdk— SSDK (Switch SDK) — direct switch ASIC programming
pppoe.ko is loaded with three holders (qca_nss_sfe,ecm,qca_nss_ppe_pppoe_mgr), meaning the standard Linux PPPoE module is integrated with the hardware acceleration path.
And it's actually working in production:
$ cat /sys/kernel/debug/qca-nss-ppe/stats/common_stats | grep flows
[v4_l3_flows]: 174
[v4_l2_flows]: 0
[v4_vp_wifi_flows]: 0
[v4_ds_flows]: 0
[v6_l3_flows]: 0
[v6_l2_flows]: 0
174 IPv4 L3 flows are currently offloaded to hardware on this device. The kernel saw the first few packets of each of these flows, ECM programmed the flow into the NSS coprocessor, and the NSS hardware is now forwarding subsequent packets without CPU involvement at all — including (per the loaded modules) flows that traverse VLAN boundaries, that involve PPPoE encapsulation, and that need to be NATed.
This is exactly the architectural pattern the writeup recommends Ubiquiti adopt for the EFG's Octeon silicon: first-packet through the kernel for policy decisions; subsequent packets fast-pathed through hardware-accelerated dataplane workers. On the Octeon, the equivalent is DPDK + VPP using the Marvell-published reference architecture. On the Qualcomm IPQ9574, the equivalent is NSS/PPE + ECM using Qualcomm's reference architecture. The pattern is the same; only the silicon vendor differs.
The UCG Fiber is the only device in Ubiquiti's gateway portfolio that engages its silicon's hardware acceleration for forwarding. Every other gateway in the lineup — across multiple silicon vendors and multiple SoC generations — has hardware acceleration available but unused.
| Device | Price | SoC | Hardware acceleration available? | Hardware acceleration engaged? |
|---|---|---|---|---|
| UCG Fiber | ~$279 | Qualcomm IPQ9574 | NSS/PPE/SFE/ECM (Qualcomm) | Yes — 174 flows in HW |
| EFG | ~$2,000 | Marvell Octeon CN9670 | DPDK + hardware NIX engines | No |
| UDM Beast | varies | Marvell Octeon CN10K + Prestera ASIC | DPDK, switchdev offload, dedicated switch ASIC | No (ASIC unused for L3; hw-tc-offload: off [fixed]) |
| UDM Pro / UDM SE / UDM Pro Max | varies | Various ARM | TSO/RSS/limited offload | No |
| UCG Max / UCG Ultra / UXG Pro / UXG Lite | varies | Various ARM | TSO/RSS/limited offload | No |
| UDR / UDR-7 / UDR-5G-Max | varies | Mediatek/Qualcomm | Vendor-specific offload | No (per available teardowns) |
The IDS/IPS architecture is identical across all of them — passive Suricata in --pcap, retroactive 3-tuple ipset blocking, EOL Suricata 6.0.12 active with 8.0.2 staged-but-unused. That stack is portable across products and silicon vendors. Ubiquiti has clearly invested in keeping the IDS/IPS architecture consistent across the lineup.
The dataplane integration is the opposite. It got done for exactly one product. The interesting question is why.
The most likely answer is that Ubiquiti ships whatever the silicon vendor's BSP provides, and only the Qualcomm BSP includes a pre-built hardware fast-path.
Qualcomm's IPQ ARM networking platform ships an OpenWrt-based BSP where the NSS/PPE/SFE/ECM stack is pre-integrated into the kernel network stack, ready to use out of the box. The router vendor compiling the BSP gets hardware-accelerated forwarding without doing the dataplane integration themselves — Qualcomm did the integration as part of the BSP. The hardware fast-path "just works" if you ship the BSP as-is.
Marvell's Octeon BSP, by contrast, ships DPDK as a separate userspace SDK. The Octeon kernel BSP gives you the NIC drivers and basic packet I/O, but the high-performance dataplane is a separate layer that the device vendor has to build themselves. Marvell publishes reference architectures (DPDK + VPP + Suricata-on-DPDK) and the silicon supports them, but actually shipping a working DPDK-accelerated gateway requires the vendor to engineer the dataplane application — write or port a control plane, integrate with the management UI, handle config persistence, integrate with the IDS/IPS pipeline, and so on. That's substantially more engineering work than just shipping a BSP.
Same pattern likely applies to the other ARM SoCs in the lineup. If hardware acceleration on a given silicon requires the vendor to build the dataplane, Ubiquiti hasn't built it. If hardware acceleration is pre-built into the BSP, Ubiquiti ships it.
This isn't "Ubiquiti can't build hardware-accelerated dataplanes" — they ship one on the UCG Fiber and it works. It's "Ubiquiti ships whatever the silicon vendor's BSP provides, and doesn't engineer dataplane integration themselves." Where the BSP includes a hardware fast-path, the customer gets one. Where the BSP doesn't — including on the flagship $2,000 EFG and the next-generation UDM Beast — the customer doesn't.
This forecloses the most charitable defense of the EFG's design. The argument would have been: "Building hardware-accelerated forwarding on a network gateway is genuinely difficult, and Ubiquiti hasn't gotten there yet." That argument fails on the UCG Fiber, where they did get there — but only because Qualcomm did the work. The corrected version of the argument would be: Ubiquiti's engineering investment goes into the IDS/IPS pipeline (consistent across products, even on EOL Suricata) and the UI/management plane (consistent across products) — but not into per-silicon dataplane engineering. When the silicon vendor ships a working dataplane in the BSP, customers benefit. When the silicon vendor leaves it as the device vendor's responsibility, Ubiquiti customers get a Linux kernel network stack instead.
The actual situation, then, is even more pointed than "the cheaper product outperforms the flagship." It's: the cheaper product outperforms the flagship because Qualcomm did dataplane engineering Ubiquiti didn't do for Marvell. The performance differential isn't a Ubiquiti achievement on the UCG Fiber — it's a Qualcomm achievement that Ubiquiti benefited from by using their BSP.
The EFG, the UDM Beast, and the rest of the lineup are running silicon whose vendors expected the device builder to engineer the dataplane. Ubiquiti didn't, on any of them.
The performance numbers reflect this. The UCG Fiber advertises 5 Gbps IDS/IPS throughput at $279 because the Qualcomm dataplane is doing the heavy lifting. The EFG, positioned as a flagship and costing ~7× as much, struggles to deliver 1-2 Gbps single-stream inter-VLAN routing — let alone with IPS enabled — because Ubiquiti is running the kernel's general-purpose network stack on the Marvell silicon instead of the dataplane Marvell expected them to build.
Putting together the EFG diagnostics and the lab measurements, the findings are unambiguous.
Finding 1: The kernel network stack on a single core has a ceiling around 5 Gbps single-stream when offloads are off, regardless of NIC
Evidence: virtio-net (4.95 Gbps) and ConnectX VF (4.74 Gbps) measure within experimental error on the same kernel with offloads disabled. The Zen 4 core is identical in both tests. The difference between 4.95 and 4.74 is in the noise.
Implication for the EFG: their 2 GHz Octeon ARM core has its own per-cycle ceiling that's 3-5× slower than Zen 4 for this workload, putting the EFG kernel forwarding ceiling at ~1.0–1.5 Gbps. Reported user numbers match this range. The hardware silicon is not what's limiting them; the per-core kernel stack is.
Evidence:
- virtio kernel forwarding: 4.95 Gbps (off) → 17.4 Gbps with flowtable (on) — 3.5× swing
- ConnectX VF kernel forwarding: 4.74 Gbps (off) → 25.3 Gbps (on) — 5.3× swing
EFG state: hw-tc-offload: off [fixed], generic-receive-offload: off. Hard-coded off in the firmware build.
Finding 3: The 5-chain iptables FORWARD pattern costs roughly half your throughput when offloads are also off
Evidence:
- virtio-net + offloads off: 4.95 Gbps → 2.36 Gbps when EFG-style rules are applied (52% drop)
- ConnectX VF + offloads on: 25.3 Gbps → 21.1 Gbps when applied (17% drop, hidden by GRO)
EFG state: identical rule structure (ALIEN → TOR → IPS → UBIOS_FORWARD_JUMP → user → default). Confirmed by direct iptables diagnostic showing 874 million packets having traversed UBIOS_FORWARD_JUMP in 8 days.
Evidence:
- virtio + EFG rules: 2.36 Gbps → 7.05 Gbps with flowtable added (3.0×)
- virtio + flowtable + offloads on: 17.4 Gbps (7.4× over 2.36 baseline)
EFG state: nf_flow_table module not loaded. nft list flowtables is empty. The kernel module isn't even installed on the device. This is a one-line configuration change in nftables that Ubiquiti could ship and immediately triple single-stream inter-VLAN performance.
The popular description: "Every packet is inspected by every loaded helper." This is approximately wrong. The actual cost depends on which phase of a flow the packet belongs to.
Phase 1 — New connection (SYN packet, first packet of a flow):
When conntrack creates a new entry for a flow, it walks nf_ct_helper_hash — a hash table keyed by L4 protocol + port — to determine if any registered helper applies. For TCP/21 (FTP control), it finds the FTP helper and attaches it to the conntrack entry. For TCP/443 (HTTPS), it finds nothing and attaches no helper. The per-new-connection cost is one hash lookup against the helper registry. Small but real.
This phase also touches nf_ct_expect_hash — the expectations table — to check if this new flow matches a previously-expected data connection (e.g., the data port that an active FTP control session announced via PORT or PASV). Empty expectations table = essentially zero cost; an active expectations table = small additional lookup.
Phase 2 — Established flow (every subsequent packet):
Once a flow has a conntrack entry, the per-packet helper logic in nf_conntrack_in() reads:
help = nfct_help(ct); // pointer load from conntrack entry
if (help && help->helper) // both NULL for non-helper flows
help->helper->help(skb, ct, ...);For a flow with no helper attached — the vast majority of traffic, since helper-relevant ports are rare — this is two pointer loads and a branch. Modern CPUs predict the not-taken branch perfectly. The cost on non-helper flows is essentially zero.
For flows that DO have a helper attached (e.g., active FTP control connection, ongoing SIP call), the helper's ->help() callback runs on every packet to inspect for protocol events (PORT command, RTP setup, etc.). This is genuine per-packet cost, but it only applies to flows on helper-recognized ports.
Why iperf3 throughput doesn't change when helpers are disabled: An iperf3 inter-VLAN test uses a single TCP connection on iperf3's port (5201 by default). That port is not a helper-recognized port. The connection has no helper attached. Phase 2's two-pointer-load-and-branch is essentially free. Disabling helpers via the UI removes the modules from memory, eliminating Phase 1 lookup cost on new connections — but it does not change anything in Phase 2 for non-helper flows.
Why helpers nonetheless matter at scale: An enterprise router doing ~10,000 new connections per second — driven by lots of short HTTP requests, DNS resolutions, and other transient flows — pays the Phase 1 helper-hash-lookup tax 10,000 times per second. Removing helpers eliminates that. It's not a per-packet win on data flows, it's a per-new-connection win.
The proper fix is not removing helpers: a correctly-architected router uses the netfilter flowtable for the data path. With flowtable, established flows bypass the entire netfilter chain (helpers included) and go through the offloaded fast path. Helpers continue to run on connection setup and on the control connection of helper protocols (e.g., FTP control), but the data connection of those protocols can be offloaded. You get full helper functionality and zero per-packet cost on data flows, simultaneously. This is what mainstream Linux distributions ship in 2026.
The EFG's kernel does not have flowtable compiled in (Section 12).
Four implementation approaches that would do this correctly:
-
nftables with explicit per-flow helper attachment (the modern, correct approach). Helpers attach only to flows matching explicit nftables rules — no global helper auto-attach, zero cost for any flow not matching the rule. Requires migrating from iptables to nftables.
-
Userspace conntrack helpers via netlink (kernel 3.6+). The kernel forwards control packets to a userspace daemon, which parses protocols and inserts expectations back via netlink. Pros: kernel stays small, helper bugs don't crash the kernel, helpers can be updated independently of the kernel. Cons: control-plane latency increase.
-
Don't NAT helper-protocol traffic at all. Modern protocols handle NAT traversal in the application layer (FTP passive mode, SIP+STUN/ICE, WebRTC). The kernel doesn't need to do ALG. Most enterprise gateways in 2026 have moved this direction; kernel helpers are legacy.
-
Keep helpers, add flowtable (the practical fix for an existing iptables-based system). Helpers run on connection setup and helper-protocol control channels; flowtable handles the data path of every other flow. Best compatibility with existing rule sets.
EFG state: A "Firewall Connection Tracking" toggle in the UniFi controller's Gateway settings exposes individual checkboxes for FTP, H.323, SIP, GRE, PPTP, and TFTP helpers. Disabling them all unloads the helper modules entirely — which addresses Phase 1 lookup overhead on new connections but does nothing for the bigger architectural issues. The toggle's existence confirms that Ubiquiti's engineering team is aware that helpers cost something. They have implemented a partial fix (the toggle) instead of the proper fix (flowtable). The proper fix would require shipping nf_flow_table.ko, which they have chosen not to do (Section 12).
Evidence: 8 parallel streams with the EFG ruleset reach 11.4 Gbps aggregate (~1.4 Gbps per stream). Single stream caps at 2.36 Gbps. The EFG's mpstat shows all 18 cores idle except the one with the active flow.
EFG state: 18 cores, but RSS hashes a single TCP 5-tuple to one queue, which binds to one core. Adding cores to a kernel-based router cannot fix single-flow performance. Faster per-core, fewer per-packet steps, hardware offload, or a userspace dataplane (which can poll across worker cores) can.
Finding 7: DPDK on the same silicon delivers 10-25× the throughput, and the vendor ships full DPDK support
Evidence:
- Lab VPP/DPDK on ConnectX with offloads: 35.6 Gbps single-stream (15× over the EFG-style baseline)
- Marvell's published cnxk PMD benchmarks: 18-36 Gbps single-core on CN9670-class silicon
- Suricata 7.0+: native DPDK input mode shipped 2023
- VPP: native cnxk plugin shipped 2020
- The full reference architecture (DPDK + VPP + Suricata-on-DPDK) is published by Marvell and field-deployed by NFV vendors
EFG state: zero DPDK. The cnxk PMD is not loaded. Suricata runs in pcap mode (per-packet kernel→userspace copy) instead of DPDK mode. Ubiquiti would lose nothing by adopting DPDK — their primary inspection workload (Suricata) supports it, their silicon vendor supports it, and the resulting performance on the same hardware would be 10-25× higher.
Finding 8: Userspace inspection processes contend with the forwarding core, with the contention pattern depending on per-process CPU pinning
Evidence (EFG): dpi-flow-stats at 39.6% CPU (with no CPU pinning — affinity mask 0x3ffff = all 18 cores allowed) + Suricata-Main at 6.9% + conntrackd at 7.0%. Suricata is configured with explicit per-thread pinning: management thread on core 0, verdict thread on core 1, workers across all cores (Section 4.5 details).
The single-stream contention story specifically: A single TCP flow's softirq lands on whichever core RSS hashes the 5-tuple to (typically core 0 by default on the EFG). On that core: Suricata's management thread is pinned there, Suricata workers may be scheduled there, and dpi-flow-stats can run there with no restriction. All of these contend with forwarding softirq for the same core's cycles.
Evidence (lab): a deliberate spinner pinned to a non-forwarding core had no effect on single-stream throughput (correctly isolated). When CPU contention is on the forwarding core, throughput drops proportionally.
Implication: even if Ubiquiti fixed every other issue, single-stream throughput would still depend on which userspace processes happen to land on the same physical core as the active flow's softirq. Mitigations include: explicit taskset/cgroup pinning of dpi-flow-stats off the dominant RSS hash core; relocating Suricata's management-cpu-set off core 0 (one-line YAML change); RSS reconfiguration to hash flows away from the cores Suricata pins to. The proper fix is Suricata in DPDK mode on dedicated worker cores (supported since Suricata 7.0, 2023), which moves all per-packet inspection out of the kernel path entirely.
Finding 9: Per-VLAN bridges instead of vlan-aware single bridge prevent kernel fast-path optimization
Evidence (EFG): br0, br3, br5, br6, br7, br254, br1111 — one bridge per VLAN. Inter-VLAN traffic must traverse multiple bridge hops plus a kernel L3 lookup.
Lab equivalent: vmbr1 with VLAN-aware mode and bridge VID filtering allows a single bridge to handle all VLANs. With flowtable on top, established flows skip the bridge slow path entirely.
Implication: even without flowtable, switching to a vlan-aware bridge architecture would simplify the data path and enable bridge VID hardware offload paths that the current per-bridge structure cannot use.
Finding 10: PPPoE WAN performance is bottlenecked by the same kernel stack, with additional encapsulation cost — and worse multi-core spread
Evidence (deployment reports): enterprise customers on 10 Gbps PPPoE fiber consistently report 2-3 Gbps single-stream WAN throughput on the EFG.
Evidence (live capture during a Netflix Fast.com test on a production EFG): six different ksoftirqd kernel threads simultaneously consuming 55-100% CPU (cores 0, 2, 7, 10, 12, 14), with concurrent userspace inspection load (Suricata 44%, ubios-udapi-ser 22%, unifi-core 16%) competing for the same cores. The PPPoE encap/decap path forces multiple kernel-stack passes per packet, each potentially landing on a different core, multiplying total CPU consumption while not improving single-flow throughput.
Evidence (mainline Linux): kernel 6.2+ ships PPPoE handling within nf_flow_table.ko — the protocol checks and helpers (nf_flow_pppoe_proto, __nf_flow_pppoe_proto, ETH_P_PPP_SES matching) are inline within nf_flow_table_ip.c, nf_flow_table_inet.c, and nf_flow_table_offload.c, all of which compile into the existing nf_flow_table.ko and nf_flow_table_inet.ko modules. The EFG runs kernel 5.15. The nf_flow_table and nf_flow_table_inet modules are not even compiled into Ubiquiti's kernel build — modinfo returns "Module not found" for both, meaning the entire flowtable infrastructure (including any PPPoE acceleration) is absent.
Implication: PPPoE WAN performance is not a hardware limitation. It is the same per-core kernel ceiling as inter-VLAN routing, with an additional encapsulation layer that mainline Linux now supports accelerating, and a multi-pass softirq pattern that is more expensive than plain inter-VLAN forwarding. The fix is a kernel rebase plus the same flowtable directive — or DPDK + accel-ppp + VPP, which Marvell publishes as a reference architecture for this exact silicon.
Finding 11: The EFG's kernel is binary-incompatible with vanilla 5.15.72 despite identifying as such, and the safety net that would catch this is disabled
Evidence: We cross-compiled nf_tables, nf_flow_table, and nf_flow_table_inet from vanilla linux-5.15.72.tar.xz (kernel.org), using the EFG's exposed /proc/config.gz as the build configuration. The resulting modules report a vermagic string identical character-for-character to the EFG's existing in-tree modules: 5.15.72-ui-cn9670 SMP mod_unload aarch64. Loading nf_tables.ko on the device caused an immediate kernel panic (NULL pointer dereference at virtual address 0x120 during module init), forcing a watchdog reboot.
Evidence (config audit):
$ zcat /proc/config.gz | grep -E "MODVERSIONS|TRIM_UNUSED_KSYMS|MODULE_SIG"
CONFIG_HAVE_ASM_MODVERSIONS=y
# CONFIG_MODVERSIONS is not set
# CONFIG_TRIM_UNUSED_KSYMS is not set
[no CONFIG_MODULE_SIG entries]
CONFIG_MODVERSIONS would have caught the binary incompatibility at load time with a clean error message. It is disabled. CONFIG_MODULE_SIG (cryptographic module signing) is not even built into the kernel. lockdown is not enabled. The root filesystem is writable via overlay.
Implication: Two findings, both serious.
First, the EFG's kernel is not actually vanilla 5.15.72 even though it identifies as 5.15.72-ui-cn9670 and reports the upstream version. Ubiquiti has applied undisclosed patches that change netfilter's internal data structures or function signatures. Customers who attempt to enable missing kernel features by building from the announced upstream tag will produce modules that load (because vermagic matches) but crash (because the real ABI doesn't). This is exactly why the GPL exists — it requires vendors to publish the complete corresponding source so customers can rebuild against the actual kernel they received, not the vanilla one it claims to be.
Second, the security configuration is unusually permissive for an enterprise security product: no module signing, no kernel lockdown, no symbol-CRC verification, writable root via overlay. Any process that becomes root can load arbitrary unsigned, unverified kernel modules with no cryptographic check. Combined with the binary-incompatible-but-not-detected ABI, this is a pathway for both accidental crashes and deliberate exploitation.
A GPL source request was filed with opensource-requests@ui.com at the time of this writing. Until it is fulfilled, even a customer with full root access on hardware they own cannot enable the missing performance features safely. Section 12 documents this experiment in detail.
The findings above translate directly to a list of prioritized configuration changes Ubiquiti could ship. None of these require new hardware. All are available in mainline Linux or as vendor-supported infrastructure from Marvell. Several are config changes that do not even require a kernel update.
What: Load the nf_flow_table kernel module and add a flowtable directive to the active nftables ruleset. The hook is software-only (no hardware offload required) and works on any modern kernel (5.4+).
Configuration sketch:
table inet filter {
flowtable f {
hook ingress priority 0
devices = { eth_lan_vlan10, eth_lan_vlan20, ... }
}
chain forward {
type filter hook forward priority 0; policy accept;
ip protocol { tcp, udp } flow add @f
ct state established,related accept
... existing security rules ...
}
}
Measured improvement: 2.36 → 7.05 Gbps single-stream (3.0×) on virtio. Combined with offloads enabled: 17.4 Gbps (7.4×).
Trade-off: Flows in the fast-path bypass conntrack and rule evaluation. Security rules must be applied to the first few packets of a flow, before it's offloaded. Existing iptables/nftables rules continue to work; only established flows are accelerated. The IPS / DPI processes that need every packet would need to be moved to a different inspection point (e.g., promiscuous tap on the bridge, or sFlow sampling) — but most of them only need flow-level visibility, which conntrack already provides.
What: Stop hard-coding hw-tc-offload off [fixed]. Enable GRO and TSO on the kernel side. On the Octeon CN9670 (and CN10K on the UDM Beast), enable the NIX hardware acceleration path — these are first-party Marvell engines designed to forward packets without ARM core involvement.
Measured improvement: 4.74 → 25.3 Gbps single-stream (5.3×) on ConnectX VF with kernel forwarder when offloads enabled. The same pattern applies to any NIC with hardware-accelerated forwarding, including the Octeon NIX.
Trade-off: Hardware offload paths typically require the kernel and the device firmware to agree on which features can be offloaded. Some advanced features (like complex iptables matchers) can't be offloaded; the kernel falls back to software for those packets. This is a graceful degradation, not a failure — the fast path handles the common case, slow path handles edge cases. Modern flowtable in switchdev mode (which ConnectX-6 Dx and Octeon CN9670 both support) hands established TCP/UDP flows directly to silicon.
What: Migrate the forwarding plane from kernel ip_forward to VPP with the Marvell-supported cnxk DPDK PMD. Move Suricata to its native DPDK mode (available since Suricata 7.0). Pin VPP worker threads and Suricata workers to dedicated CPU cores, leaving the control plane (UniFi management, control plane protocols, dpi-flow-stats summaries) on a separate core.
Why this is the biggest win: Marvell publishes complete DPDK + VPP reference architectures for the OCTEON family. The cnxk PMD is open-source, well-maintained, and ships with mainline DPDK. Suricata's DPDK mode is production-deployed by major NFV vendors. Every component Ubiquiti needs is already vendor-supported, mainline open-source software. They lose nothing by adopting it.
Estimated improvement on EFG silicon:
- Single-stream inter-VLAN: from ~1 Gbps to 15-25 Gbps (15-25×)
- PPPoE WAN single-stream: from ~3 Gbps to 8-10 Gbps (line rate on 10G PPPoE)
- Aggregate: from a few Gbps to line rate on both 25G ports (50 Gbps)
- Inspection (Suricata): from kernel-pcap mode to DPDK direct, eliminating per-packet kernel→userspace copy
Trade-off: Largest engineering investment of any fix. Ubiquiti would need to rewrite their forwarding plane on top of VPP's API and integrate VPP's CLI/API with their UniFi controller. However, all the heavy lifting (the PMD, the dataplane, the Suricata DPDK integration) already exists. They are integrating, not inventing.
What: Replace br0, br3, br5, ... with a single bridge in bridge_vlan_filtering=1 mode, with VID assignments per port. Combined with nf_flow_table on the same bridge, this enables flowtable to short-circuit established flows entirely within the bridge layer.
Measured improvement: Indirect — enables Fix 1 and Fix 2 to be more effective, particularly for inter-VLAN flows that today must traverse multiple bridges. Direct measurements not made in this study, but Linux upstream has documented order-of-magnitude improvements in similar setups.
Trade-off: Configuration migration. Existing ruleset references to specific bridge devices need updating to reference the unified bridge. Manageable as a firmware update.
What: Use cgroup, systemd CPUAffinity=, or taskset to ensure dpi-flow-stats (currently unrestricted, allowed on all 18 cores) is pinned to cores that aren't on the dominant RSS hash path for inter-VLAN traffic. Separately, in /usr/share/ubios-udapi-server/ips_6/config/suricata_ubios_high.yaml, move management-cpu-set from [ 0 ] to a higher core (e.g., [ 2 ]) so the management thread doesn't contend with single-flow forwarding softirq on core 0. Additionally, RSS could be reconfigured to hash inter-VLAN flows away from cores 0 and 1 (which Suricata already pins to for management and verdict threads).
Measured improvement: Indirect, on the order of 10-20% on single-stream throughput, because it frees the specific core that's bottlenecking that flow from cycle competition. Larger benefit on systems where a flow lands on core 0 (the default) by changing where its competitors live.
Trade-off: None of consequence. This is basic Linux performance hygiene that any production router enables. The cost is a few sysfs/systemd-cgroup changes plus one YAML edit. Becomes moot after Fix 3 (with DPDK, each Suricata/dataplane worker has its own dedicated core by design).
What: The current ruleset is on the legacy iptables (xt_*) backend with 839 rules. Native nftables is faster per-rule, supports flowtable natively (Fix 1 builds on this), supports atomic ruleset replacement (no flushing), and is the future of Linux netfilter.
Measured improvement: Single-digit percentage points on its own; enables Fix 1 to reach its full potential.
Trade-off: Migration cost. Tools like iptables-translate automate most of it. The tools that produce the existing ruleset (presumably internal Ubiquiti config generators) need to emit nft syntax instead.
What: The UniFi controller already exposes a "Firewall Connection Tracking" control in Gateway settings, with checkboxes for FTP, H.323, SIP, GRE, PPTP, and TFTP helpers. Enterprise deployments without those legacy protocols can disable them all to unload the helper modules entirely.
What this actually does: Removes Phase 1 helper-hash-lookup overhead on new connections (see Section 10 Finding 5). On a router doing tens of thousands of new connections per second, this is a meaningful reduction in connection-setup CPU cost.
What this does NOT do: It does not change throughput on already-established TCP flows like an iperf3 test. The Phase 2 per-packet cost on non-helper flows is essentially zero whether helpers are loaded or not. iperf3 inter-VLAN single-stream throughput is unchanged.
Why this is a partial fix: The architecturally correct answer is to use the kernel's flowtable for the data path so that established flows bypass the entire netfilter chain (helpers and all) at line rate, while helpers continue to handle the control connections of legitimate helper-protocol traffic. That requires shipping nf_flow_table.ko, which the EFG does not have (Section 12). The toggle's existence is evidence that Ubiquiti's engineering team understands the helpers-cost-something question; they have shipped a partial mitigation rather than the proper fix.
Recommended action for administrators: If your deployment doesn't use FTP active-mode NAT, H.323 video conferencing, SIP through ALG (most modern SIP deployments use STUN/ICE instead), PPTP VPN, or TFTP, disable all of them. It's a free win on connection-setup costs.
What: Linux 5.15 LTS dates from late 2021. Kernel 6.6 LTS (the current LTS) includes substantial nftables, flowtable, bridge improvements, and PPPoE flowtable acceleration handled inline within nf_flow_table.ko (added in kernel 6.2+). Kernel 6.12 LTS includes hardware-offloaded flowtable for several NICs and improved per-CPU optimizations.
Measured improvement: Compounding with Fix 1, Fix 2, and the PPPoE acceleration. Recent kernels have made nf_flow_table faster per-packet, made hardware-offload setup easier, and added PPPoE-specific acceleration that the EFG completely lacks today.
Trade-off: Vendor kernel update. The Octeon vendor BSP (Marvell's "ubuntu-cn9670") will need to be rebased on a newer kernel. Not trivial but routine for a hardware vendor; Marvell themselves publish 6.x-based BSP releases.
| Priority | Fix | Effort | Single-stream improvement |
|---|---|---|---|
| 1 | Enable flowtable | Low (config) | 3.0× |
| 2 | Enable hardware offloads | Low–Medium (config + firmware) | up to 5.3× |
| 3 | Adopt DPDK + VPP + Suricata-DPDK | High (engineering) | 15-25× — and fixes PPPoE too |
| 4 | Newer kernel (5.15 → 6.6+) | Medium | enables PPPoE flowtable, +small kernel gains |
| 5 | Pin inspection processes off data-path core | Low (config) | small but additive |
| 6 | Per-VLAN bridges → vlan-aware single bridge | Medium (config migration) | enables 1+2 |
| 7 | iptables → nftables | Medium | enables 1, small direct |
| 8 | Conntrack helper toggles (already shipped — disable in UI) | Free (UI checkbox) | none on iperf3, small on connection setup |
Doing Fix 1 alone gets you 3× the single-stream throughput. Fix 1+2 gets you 7×. Fix 3 — the long-term architectural fix that the silicon vendor literally publishes a reference architecture for — gets you 15-25×. The hardware does not need to change.
The analysis to this point rests on lab measurements made on x86 hardware that reproduces the EFG's software stack. The lab data is reproducible and self-consistent, but a fair reader can ask: would the recommended fixes actually work on the real device?
To find out, we attempted the most surgical of the recommended fixes — adding the missing nftables flowtable kernel modules — to a production EFG. The exercise was instructive in ways we did not anticipate, and the results materially strengthen Section 10's findings about the EFG's kernel.
What follows is a complete, honest record of the attempt. Both attempts ultimately crashed the device. Neither outcome was the desired success path, but the failure modes themselves are diagnostic — they reveal precisely how far Ubiquiti's kernel diverges from any reproducible public source.
Loading a third-party kernel module into a running kernel requires a few prerequisites:
- A matching kernel version (
vermagic). The Linux module loader rejects any module whosevermagicstring doesn't match the running kernel's exactly. - Module loading not blocked by signing. If
CONFIG_MODULE_SIG_FORCE=yormodule.sig_enforce=1, only modules signed by an in-kernel trusted key can load. - No kernel lockdown. If a Secure Boot lockdown is engaged, module loading from disk is restricted regardless of signing config.
- A writable filesystem location, since module files must be readable from disk by
init_module(2)orfinit_module(2).
We confirmed each on a production EFG via SSH:
$ cat /proc/cmdline
console=ttyAMA0,115200n8 earlycon=pl011,0x87e028000000 maxcpus=18 isolcpus=12
rootwait rw coherent_pool=16M pcie_aspm=off net.ifnames=0 sysid=ea3d
root=PARTUUID=...
No module.sig_enforce=1. No lockdown= argument. No lsm=lockdown,....
$ cat /sys/module/module/parameters/sig_enforce
N
Module signing not enforced.
$ zcat /proc/config.gz | grep -E "^CONFIG_(MODULE_SIG|SECURITY_LOCKDOWN|MODVERSIONS|TRIM_UNUSED_KSYMS)"
# CONFIG_MODULE_SIG is not set
# CONFIG_MODVERSIONS is not set
# CONFIG_TRIM_UNUSED_KSYMS is not set
# CONFIG_SECURITY_LOCKDOWN_LSM is not set
This was both encouraging and concerning. Encouraging because it meant we had a clean path to load a custom-built module if we could match vermagic. Concerning because these missing options are exactly the safeguards a production firmware should have:
MODULE_SIG: prevents loading unsigned modules. Any process withCAP_SYS_MODULE(root, in containers if not seccomp'd) can load arbitrary kernel code.MODVERSIONS: adds CRC checksums to every exported symbol. A module built against a kernel with subtly different struct layouts will be refused at load time rather than crashing the kernel later.TRIM_UNUSED_KSYMS: limits the surface area of exposed kernel symbols.SECURITY_LOCKDOWN_LSM: restricts what root can do to a running kernel.
The implications of these absences are explored further in Section 10, Finding 11. For the experiment, they meant that load-time symbol mismatches would not be caught — the kernel would happily start executing code with bad assumptions about struct layouts.
The EFG's filesystem is overlayfs root with a writable upper layer at /mnt/.rwfs/data. Modules placed in /tmp survive long enough to load.
The flowtable modules (nf_flow_table.ko, nf_flow_table_inet.ko, plus nf_tables.ko as a dependency) are absent from the EFG's /lib/modules/:
$ find /lib/modules/$(uname -r) -name 'nf_flow_table*' -o -name 'nf_tables.ko'
[no output]
$ modinfo nf_flow_table
modinfo: ERROR: Module nf_flow_table not found
The modules are not merely disabled; they are not present in the build. We needed to compile them ourselves.
A separate build VM was provisioned on the lab host:
- Ubuntu 24.04 LTS, 16 vCPU, 32 GB RAM
gcc-10-aarch64-linux-gnu10.5.0 from the noble-universe repository (matches the EFG's compiler family)- Linux 5.15.72 source tree from
kernel.org
The EFG's running kernel reports itself as:
$ uname -r
5.15.72-ui-cn9670
$ uname -a
Linux EFG-Home-SP 5.15.72-ui-cn9670 #5.15.72 SMP Wed Apr 15 23:39:47 CST 2026
aarch64 GNU/Linux
$ strings /lib/modules/5.15.72-ui-cn9670/kernel/net/netfilter/nf_conntrack_ftp.ko \
| grep -E '^(vermagic|name)='
vermagic=5.15.72-ui-cn9670 SMP mod_unload aarch64
name=nf_conntrack_ftp
The build process:
$ export ARCH=arm64
$ export CROSS_COMPILE=aarch64-linux-gnu-
$ export CC=aarch64-linux-gnu-gcc-10
$ cd ~/efg-build/vanilla-5.15.72/linux-5.15.72
$ cp ~/efg-build/efg-running.config .config
# Set CONFIG_LOCALVERSION inside the .config (not the env)
$ ./scripts/config --set-str CONFIG_LOCALVERSION "-ui-cn9670"
# Enable the modules we want to build
$ ./scripts/config --module CONFIG_NF_TABLES
$ ./scripts/config --module CONFIG_NF_FLOW_TABLE
$ ./scripts/config --module CONFIG_NF_FLOW_TABLE_INET
# Disable BTF generation (would require pahole on EFG kernel — not available)
$ ./scripts/config --disable CONFIG_DEBUG_INFO_BTF
# Reconcile
$ make olddefconfig
$ time make -j$(nproc) modules
real 1m52s
$ for ko in net/netfilter/nf_tables.ko \
net/netfilter/nf_flow_table.ko \
net/netfilter/nf_flow_table_inet.ko; do
strings $ko | grep -E '^(vermagic|name)='
done
vermagic=5.15.72-ui-cn9670 SMP mod_unload aarch64
name=nf_tables
vermagic=5.15.72-ui-cn9670 SMP mod_unload aarch64
name=nf_flow_table
vermagic=5.15.72-ui-cn9670 SMP mod_unload aarch64
name=nf_flow_table_inet
Three modules. All vermagic strings byte-perfect matches for the EFG kernel.
The modules were copied to the EFG and loading was attempted in dependency order:
$ scp nf_tables.ko nf_flow_table.ko nf_flow_table_inet.ko \
root@efg-prod:/tmp/
$ ssh root@efg-prod
# cd /tmp
# insmod ./nf_tables.ko
[connection drops, device reboots]
The kernel oops, captured before the watchdog reboot:
[ ... ] Unable to handle kernel NULL pointer dereference at virtual address 0x120
[ ... ] Mem abort info:
[ ... ] ESR = 0x96000004
[ ... ] FSC = 0x4: level 0 translation fault
[ ... ] Internal error: Oops: 96000004 [#1] SMP
[ ... ] Modules linked in: nf_tables(+) wireguard libchacha20poly1305 ...
[ ... ] CPU: 3 PID: 211748 Comm: insmod Tainted: P W O 5.15.72-ui-cn9670 #5.15.72
[ ... ] Hardware name: Marvell OcteonTX CN96XX board (DT)
[ ... ] pc : nf_tables_init_net+0x18/0x94 [nf_tables]
[ ... ] lr : ops_init+0x3c/0x120
[ ... ] Call trace:
[ ... ] nf_tables_init_net+0x18/0x94 [nf_tables]
[ ... ] ops_init+0x3c/0x120
[ ... ] register_pernet_operations+0xec/0x240
[ ... ] register_pernet_subsys+0x2c/0x50
[ ... ] nf_tables_module_init+0x24/0x100 [nf_tables]
The HA secondary in the home cluster failed over within ~8 seconds. Service was restored without operator intervention.
The crash happened at byte 24 of the function nf_tables_init_net — extremely early in the per-network-namespace initialization. nf_tables_init_net is one of the very first things register_pernet_subsys calls when the module starts up. It tries to read a field at offset 0x120 from a struct pointer that the kernel allocated, and the kernel handed back a struct that doesn't have a valid pointer at that offset.
This isn't a "missing symbol" error or a "wrong function signature" error. The module loaded successfully. Its symbols resolved against the running kernel's symbol table. Execution started. And then, within microseconds, it dereferenced a struct field at an offset where the running kernel doesn't have what our module expected.
That's an ABI mismatch — the structure layout in our build's view of the kernel is different from the structure layout in the EFG's running kernel.
The crash happens because:
# CONFIG_MODVERSIONS is not set
# CONFIG_TRIM_UNUSED_KSYMS is not set
Without MODVERSIONS, the kernel module loader has no per-symbol CRC to compare. Vermagic only checks "this is kernel 5.15.72-ui-cn9670 SMP aarch64" — it doesn't say "the struct net has a particular field at offset 0x120." If the EFG's nf_tables_pernet struct has a different field count than vanilla's, the build still produces a module that loads cleanly. It just crashes when execution hits a misaligned access.
This means either:
- (a) Ubiquiti rebased Linux 5.15.72 on top of patches from a different kernel version, OR
- (b) Ubiquiti or a vendor (Marvell) added fields to internal structures that vanilla 5.15.72 doesn't have, OR
- (c) Both.
Section 12.5 below addresses (b) directly by attempting to build against Marvell's complete published BSP — the largest plausible source of vendor-specific kernel patches for this silicon.
The Marvell OCTEON CN9670 SoC has substantial vendor-specific Linux support that is not in mainline. Marvell maintains kernel patches for their hardware engines (NIX network units, RVU resource virtualization, NPA packet allocator, SSO event scheduler, CPT crypto), and these patches frequently touch core kernel infrastructure including netfilter (where Marvell integrates hardware flow offload acceleration).
Marvell publishes their kernel patches through the Yocto Project's linux-yocto repository, branch v5.15/standard/cn-sdkv5.15/octeon, maintained by Bo Sun (Marvell engineer) and merged by the Yocto Project's kernel maintainer (Bruce Ashfield). This is a public, GPL-licensed source tree.
$ git clone https://git.yoctoproject.org/linux-yocto.git linux-yocto-cnxk-5.15
$ cd linux-yocto-cnxk-5.15
$ git checkout v5.15/standard/cn-sdkv5.15/octeon
$ head -5 Makefile
# SPDX-License-Identifier: GPL-2.0
VERSION = 5
PATCHLEVEL = 15
SUBLEVEL = 203
EXTRAVERSION =
The branch HEAD is at 5.15.203 (a stable update) with the full Marvell OCTEON CN9K patch set applied on top.
Examination of the source tree shows the BSP modifies sixteen netfilter-related header files compared to vanilla Linux 5.15.72:
$ for f in $(find ~/vanilla-5.15.72/include -name "*netfilter*" -o -name "*nf_*"); do
rel=${f#*/include/}
bsp=~/linux-yocto-cnxk-5.15/include/$rel
if [ -f "$bsp" ] && ! diff -q "$f" "$bsp" >/dev/null 2>&1; then
echo "DIFFERS: $rel"
fi
done
DIFFERS: net/netfilter/nf_conntrack.h
DIFFERS: net/netfilter/nf_conntrack_count.h
DIFFERS: net/netfilter/nf_conntrack_timeout.h
DIFFERS: net/netfilter/nf_flow_table.h
DIFFERS: net/netfilter/nf_nat_redirect.h
DIFFERS: net/netfilter/nf_tables.h
DIFFERS: net/netfilter/nf_tables_core.h
DIFFERS: net/netfilter/nf_tproxy.h
DIFFERS: net/netns/netfilter.h
DIFFERS: linux/netfilter.h
DIFFERS: linux/netfilter_defs.h
DIFFERS: linux/netfilter/nf_conntrack_sctp.h
DIFFERS: uapi/linux/netfilter_bridge.h
DIFFERS: uapi/linux/netfilter/nf_conntrack_common.h
DIFFERS: uapi/linux/netfilter/nf_conntrack_sctp.h
DIFFERS: uapi/linux/netfilter/nf_tables.h
Several of these headers contain function-signature changes that explain why a vanilla-built module would crash. For example, in nf_conntrack_count.h:
-unsigned int nf_conncount_count(struct net *net,
- struct nf_conncount_data *data,
- const u32 *key,
- const struct nf_conntrack_tuple *tuple,
- const struct nf_conntrack_zone *zone);
+unsigned int nf_conncount_count_skb(struct net *net,
+ const struct sk_buff *skb,
+ u16 l3num,
+ struct nf_conncount_data *data,
+ const u32 *key);
The function was renamed, and its signature changed. In nf_flow_table.h:
-int flow_offload_route_init(struct flow_offload *flow,
- const struct nf_flow_route *route);
+void flow_offload_route_init(struct flow_offload *flow,
+ struct nf_flow_route *route);
Return type changed from int to void; const removed from the route argument.
The same header backports a feature from kernel 6.2 — PPPoE flowtable acceleration — into 5.15:
+static inline bool nf_flow_pppoe_proto(struct sk_buff *skb, __be16 *inner_proto)
+{
+ if (!pskb_may_pull(skb, ETH_HLEN + PPPOE_SES_HLEN))
+ return false;
+
+ *inner_proto = __nf_flow_pppoe_proto(skb);
+ return true;
+}
This last item is significant: Marvell's BSP includes a PPPoE flowtable backport that mainline 5.15 does not have. If we can build a module against this BSP and load it on the EFG, we should — in principle — get not only inter-VLAN flowtable acceleration but PPPoE flowtable acceleration as well.
The build:
$ cd linux-yocto-cnxk-5.15
# Force SUBLEVEL=72 to match EFG vermagic (BSP HEAD is 5.15.203)
$ sed -i 's/^SUBLEVEL = .*/SUBLEVEL = 72/' Makefile
# Suppress kbuild dirty marker
$ touch .scmversion
# Apply EFG running config and target modules
$ cp ~/efg-build/efg-running.config .config
$ ./scripts/config --set-str CONFIG_LOCALVERSION "-ui-cn9670"
$ ./scripts/config --module CONFIG_NF_TABLES
$ ./scripts/config --enable CONFIG_NF_TABLES_INET
$ ./scripts/config --enable CONFIG_NF_TABLES_IPV4
$ ./scripts/config --enable CONFIG_NF_TABLES_IPV6
$ ./scripts/config --module CONFIG_NF_FLOW_TABLE
$ ./scripts/config --module CONFIG_NF_FLOW_TABLE_INET
$ ./scripts/config --enable CONFIG_NF_FLOW_TABLE_IPV4
$ ./scripts/config --enable CONFIG_NF_FLOW_TABLE_IPV6
$ ./scripts/config --disable CONFIG_DEBUG_INFO_BTF
$ ./scripts/config --disable CONFIG_MODULE_SIG_ALL
$ make olddefconfig
$ make kernelrelease
5.15.72-ui-cn9670
$ time make -j$(nproc)
real 1m59s
Five modules built, all with byte-perfect vermagic:
$ for ko in $(find . -name 'nf_tables.ko' -o -name 'nf_flow_table*.ko' | sort); do
echo "=== $(basename $ko) ==="
strings $ko | grep -E '^(vermagic|name|depends)='
done
=== nf_flow_table.ko ===
name=nf_flow_table
vermagic=5.15.72-ui-cn9670 SMP mod_unload aarch64
=== nf_flow_table_inet.ko ===
name=nf_flow_table_inet
depends=nf_flow_table,nf_tables
vermagic=5.15.72-ui-cn9670 SMP mod_unload aarch64
=== nf_flow_table_ipv4.ko ===
name=nf_flow_table_ipv4
depends=nf_flow_table,nf_tables
vermagic=5.15.72-ui-cn9670 SMP mod_unload aarch64
=== nf_flow_table_ipv6.ko ===
name=nf_flow_table_ipv6
depends=nf_flow_table,nf_tables
vermagic=5.15.72-ui-cn9670 SMP mod_unload aarch64
=== nf_tables.ko ===
name=nf_tables
vermagic=5.15.72-ui-cn9670 SMP mod_unload aarch64
# insmod ./nf_tables.ko
[connection drops, device reboots]
Captured kernel trace before reboot:
[ 3368.013405] Unable to handle kernel NULL pointer dereference at virtual address 0
[ 3368.022216] Mem abort info:
[ 3368.025005] ESR = 0x96000005
[ 3368.028072] EC = 0x25: DABT (current EL), IL = 32 bits
[ 3368.033402] FSC = 0x05: level 1 translation fault
[ 3368.074382] Modules linked in: nf_tables(+) wireguard libchacha20poly1305 ...
xt_geoip(O) nf_app(PO) t_miner(PO) tdts(PO) tm_crypto(O)
xt_dyn_random ip6table_nat xt_conntrack xt_connmark xt_TCPMSS pppoe
pppox bonding xt_dpi(O) ip6table_mangle iptable_mangle ip6table_filter
ip6_tables uio_pdrv_genirq ui_lcm(O) ifb ppp_generic slhc
ubnthal(PO) ubnt_common(PO) drm drm_panel_orientation_quirks
[ 3368.121977] CPU: 3 PID: 211748 Comm: insmod Tainted: P W O 5.15.72-ui-cn9670 #5.15.72
[ 3368.130936] Hardware name: Marvell OcteonTX CN96XX board (DT)
[ 3368.143638] pc : nf_tables_init_net+0x18/0x94 [nf_tables]
[ 3368.149059] lr : ops_init+0x3c/0x120
[ 3368.227314] x2 : ffff00019027b300 x1 : 0000000000000000 x0 : 0000000000000000
[ 3368.229754] Call trace:
[ 3368.234825] nf_tables_init_net+0x18/0x94 [nf_tables]
[ 3368.238053] ops_init+0x3c/0x120
[ 3368.242840] register_pernet_operations+0xec/0x240
[ 3368.247195] register_pernet_subsys+0x2c/0x50
[ 3368.252609] nf_tables_module_init+0x24/0x100 [nf_tables]
Identical crash signature. nf_tables_init_net+0x18, called from the same path.
Two builds:
| Source tree | Result |
|---|---|
| Vanilla Linux 5.15.72 (kernel.org) | Crash at nf_tables_init_net+0x18 |
Marvell BSP linux-yocto v5.15/standard/cn-sdkv5.15/octeon HEAD with SUBLEVEL forced to 72 |
Crash at nf_tables_init_net+0x18 |
If the crash were caused by Marvell BSP patches, the BSP-built module would have crashed somewhere different (or — ideally — not at all). It crashed at the exact same instruction. That tells us:
- The crash is NOT primarily caused by Marvell BSP patches; it's caused by something on top of the BSP
- Ubiquiti has applied additional, non-public patches to the kernel that affect netfilter per-net data layout
- These additional patches are not derivable from any combination of Linux mainline + Marvell's published OCTEON BSP
The Modules linked in line of the panic trace lists the modules already loaded on the EFG when our module tried to initialize:
xt_geoip(O) nf_app(PO) t_miner(PO) tdts(PO) tm_crypto(O)
xt_dyn_random ip6table_nat xt_conntrack xt_connmark ...
xt_dpi(O) ... ui_lcm(O) ... ubnthal(PO) ubnt_common(PO)
The taint flags (O) and (PO) in Linux's module taint vocabulary mean:
O— out-of-tree moduleP— proprietary (non-GPL) modulePO— both proprietary and out-of-tree
The presence of t_miner(PO), tdts(PO), nf_app(PO), xt_geoip(O), xt_dyn_random, tm_crypto(O), xt_dpi(O), ui_lcm(O), ubnthal(PO), and ubnt_common(PO) in the running kernel's module list is documentary evidence of the closed-source kernel modules Ubiquiti is shipping.
Section 14 returns to this point to evaluate the GPL implications.
Before drawing conclusions, we examined the EFG's existing kernel modules to determine whether Ubiquiti ships debug information that could aid investigation.
$ file /lib/modules/$(uname -r)/kernel/net/netfilter/nf_conntrack_ftp.ko
/lib/modules/.../nf_conntrack_ftp.ko: ELF 64-bit LSB relocatable, ARM aarch64,
version 1 (SYSV), BuildID[sha1]=5827c50c..., not stripped
$ readelf -S nf_conntrack_ftp.ko | grep -i debug
[30] .gnu_debuglink PROGBITS 0000000000000000 00001ed0
Modules are not stripped — symbol tables are intact, function and variable names are preserved. However, the only debug section is .gnu_debuglink, which is a 4-byte CRC + filename pointer that says "the actual debug info is in a separate file." That separate file (*.ko.debug) is not shipped on the production firmware.
This is by itself a defensible engineering decision (debug files are large), but combined with MODVERSIONS=N and kptr_restrict=0 (see Section 13 below), it creates a peculiar combination:
- A normal user with sufficient privilege can dump the running kernel's complete symbol table at full virtual addresses
- But cannot match those symbols to source-level constructs (struct field names, member offsets) without the debug info
- And cannot rely on the kernel's own ABI-version tracking to detect mismatched modules
The debug info isn't shipped, so reverse-engineering structure layouts requires examining the binary kernel image directly. Section 13 documents what such an examination reveals.
The crash at nf_tables_init_net+0x18 told us that the running kernel's internal layout differs from any combination of public sources we could build against. To quantify how far it diverges, we extracted the kernel image from the EFG and compared its symbol table against the symbol tables of vanilla Linux 5.15.72 and our Marvell BSP build.
The EFG's kernel image is on disk at /boot/vmlinuz-5.15.72-ui-cn9670:
$ ls -la /boot/vmlinuz-5.15.72-ui-cn9670
-rw-r--r-- 1 root root 12071956 ... /boot/vmlinuz-5.15.72-ui-cn9670
$ file /boot/vmlinuz-5.15.72-ui-cn9670
gzip compressed data, max compression, from Unix, original size 28811776
$ gunzip -c /boot/vmlinuz-5.15.72-ui-cn9670 > efg-vmlinuz
$ binwalk efg-vmlinuz | head -3
DECIMAL HEXADECIMAL DESCRIPTION
0 0x0 Linux kernel ARM64 image, load offset: 0x0,
image size: 29818880 bytes, little endian, 64k page size
$ strings -a efg-vmlinuz | grep "Linux version"
Linux version 5.15.72-ui-cn9670 (bdd@builder)
(gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld 2.35.2)
#5.15.72 SMP Wed Apr 15 23:39:47 CST 2026
The kallsyms symbol table is dumped via /proc/kallsyms:
$ wc -l /proc/kallsyms
130789 /proc/kallsyms
$ head -2 /proc/kallsyms
ffff800008000000 T _text
ffff800008010000 T _stext
We note that kallsyms is unrestricted — full virtual addresses are visible. On most production systems, kernel.kptr_restrict is set to 1 or 2, which causes kallsyms to either redact or zero out the address column. The EFG ships with kptr_restrict=0. This is a security observation in its own right (it makes ROP and KASLR-bypass attacks easier), but for our purposes it provided complete ground-truth symbol data.
We extracted the symbol tables from each source:
# Symbols in EFG's running kernel
$ awk '{print $3}' /tmp/efg-kallsyms.txt | sort -u > /tmp/efg-syms.txt
# Symbols in our Marvell BSP build
$ nm ~/efg-build/marvell-bsp/linux-yocto-cnxk-5.15/vmlinux \
| awk '{print $3}' | sort -u > /tmp/bsp-syms.txt
# Symbols in vanilla 5.15.72
$ nm ~/efg-build/vanilla-5.15.72/linux-5.15.72/vmlinux \
| awk '{print $3}' | sort -u > /tmp/vanilla-syms.txt
$ wc -l /tmp/*-syms.txt
115998 /tmp/bsp-syms.txt
120399 /tmp/efg-syms.txt
112581 /tmp/vanilla-syms.txt
The diff: symbols present in the EFG kernel but absent from BOTH vanilla 5.15.72 AND the Marvell BSP build:
$ comm -23 /tmp/efg-syms.txt \
<(sort -u /tmp/vanilla-syms.txt /tmp/bsp-syms.txt) \
| grep -vE "^(\.L[0-9]+|\.LC[0-9]+|\.LBE|\.LFE|\.LFB|\.Letext|\.Ldebug|\.Lframe|__compound_literal\.|__func__\.|__warned\.|CSWTCH\.)" \
> /tmp/efg-unique-real-syms.txt
$ wc -l /tmp/efg-unique-real-syms.txt
6357 /tmp/efg-unique-real-syms.txt
After filtering out compiler-generated local labels (which vary across every build of every kernel and carry no information), 6,357 unique symbols exist in the EFG's kernel that are present in neither vanilla Linux 5.15.72 nor Marvell's published OCTEON BSP.
Grouping the unique symbols by name pattern reveals what Ubiquiti added:
| Category | Symbol count | Examples |
|---|---|---|
tdts_* (Trend Micro Deep-packet Threat Surveillance) |
116 | tdts_shell_dpi_l3_skb, tdts_shell_dpi_register_mt |
tm_* (Trend Micro shared) |
33 | tm_crypto_* family |
ubnthal_* (Ubiquiti HAL) |
45 | ubnthal_get_controller_host, ubnthal_get_cputype |
ubnt_* (Ubiquiti utilities) |
additional | ubnt_blk_wp_callback, ubnt_mtd_partition_read |
| HTTP protocol decoder (kernel-space) | dozens | BuildHTTP_request_KeywordTries, Create_HTTP_Protocol_Decoder |
| H.323 protocol decoder (kernel-space) | dozens | DecodeQ931, DecodeMultimediaSystemControlMessage |
nf_*dpi* (Deep Packet Inspection conntrack extensions) |
several | nf_conntrack_dpi_init, nf_ct_ext_dpi_destroy, nf_dpi_proc_dir |
dpi_* (Deep Packet Inspection engine) |
dozens | __kstrtab_dpi_main, related classification entry points |
wg_* (WireGuard, partly upstream) |
113 | wg_* |
| Firmware signing key blobs | a few | UDMENT_CN9670_FW_KEY, UXG_AL324_FW_KEY |
A note on terminology: throughout this section, "DPI" refers to Deep Packet Inspection — the application-layer traffic-classification feature that powers the UniFi dashboard's per-application traffic statistics and threat management. This is distinct from Marvell's hardware DPI block (DMA Packet Interface, also abbreviated DPI), which is a PCIe DMA engine on the OCTEON SoC and shows up in the kernel image as register-name strings like DPI_DMA_CONTROL and DPI_REQQ_INT. Those Marvell hardware-driver symbols are present in the public BSP and don't appear in the 6,357-symbol delta. The dpi_*, tdts_*, nf_*dpi*, and xt_dpi symbols below are the inspection-software layer Ubiquiti added on top.
Some of these are unsurprising (ubnthal_* is a clean abstraction layer; WireGuard was upstream by 5.6 but Ubiquiti may have backported aspects). Others are deeply diagnostic.
The most consequential finding is in the nf_* namespace:
nf_conntrack_dpi_fini
nf_conntrack_dpi_init
nf_ct_ext_dpi_destroy
nf_dpi_proc_dir
The Linux conntrack subsystem has an extension framework (include/net/netfilter/nf_conntrack_extend.h) that allows kernel modules to attach per-flow metadata to each struct nf_conn. Adding a new extension type requires changes in both:
enum nf_ct_ext_idinnf_conntrack_extend.h(adding a new value)- The static array
nf_ct_ext_types(adding a new entry) - Anywhere code iterates over extension types
The presence of nf_ct_ext_dpi_destroy is direct evidence that Ubiquiti has added a new conntrack extension (NF_CT_EXT_DPI or similar) to track DPI metadata per flow.
This change is precisely the kind that would alter struct nf_conn layout and per-net data structure layout — exactly the kind of change that would explain why nf_tables.ko built against any public source crashes when it tries to register a pernet_operations against the running kernel.
Examined more closely, the tdts namespace exposes kernel symbols:
__ksymtab_tdts_shell_dpi_l2_eth
__ksymtab_tdts_shell_dpi_l3_data
__ksymtab_tdts_shell_dpi_l3_skb
__ksymtab_tdts_shell_dpi_register_mt
__ksymtab_tdts_shell_dpi_unregister_mt
__ksymtab_dpi_main
The __ksymtab_* and __kstrtab_* symbols are how the kernel records what symbols a module exports. The names dpi_l2_eth, dpi_l3_data, dpi_l3_skb indicate these are functions for handling Ethernet frames and IPv4/IPv6 packets at layer 2 and layer 3 respectively. The _register_mt and _unregister_mt suffixes are netfilter xt_match (match-target) registration entry points.
The runtime panic dump in Section 12 showed these modules tagged tdts(PO) and t_miner(PO) — proprietary, out-of-tree.
The "tdts" name strongly suggests Trend Micro Smart Protection Network ("TMSPN" — TM Deep-packet Threat Surveillance, abbreviated tdts). Trend Micro licenses their threat-detection engine to network device vendors as a closed-source kernel module. The tm_crypto(O) and t_miner(PO) modules in the same panic trace fit the pattern: t_miner is a content-pattern matcher, tm_crypto is the encrypted-traffic analyzer.
These modules are not Ubiquiti's own code. They are licensed proprietary code from Trend Micro that Ubiquiti has integrated into their firmware. They link directly against kernel symbols (notable per the xt_dpi(O) netfilter match registered in the kernel's tainted-module list).
The unique symbols also reveal that Ubiquiti has embedded application-layer protocol decoders directly in the kernel:
BuildHTTP_request_KeywordTries
Close_HTTP_Request_Connection
Create_HTTP_Protocol_Decoder
Free_HTTP_Protocol_Decoder
HTTP_Connection_Lost_Count
HTTP_Req_Count
Init_HTTP_Protocol_Decoder
NormalizeURI
Parse_HTTP_Request
ScanHTTPVersion
ScanRequestHeaders
URINormalize
DecodeMultimediaSystemControlMessage
DecodeQ931
DecodeRasMessage
_AdmissionConfirm
_AdmissionRequest
_Alerting_UUIE
The HTTP decoder symbols (camelCase, with _HTTP_ infix) appear to be from a Trend Micro protocol-parsing library running in kernel space. The H.323/Q.931 decoder symbols are similarly out-of-place for a kernel — these would normally live in userspace.
Running parsers for HTTP, H.323, and similar attacker-controllable formats inside the kernel is a substantial security risk. A bug in any of these decoders becomes a kernel vulnerability. Mainstream Linux distributions and other vendors deliberately keep this kind of code in userspace (Suricata, Snort, etc.) for exactly this reason.
To put 6,357 symbols in perspective:
- Vanilla 5.15.72 has 112,581 unique symbols
- Marvell's published BSP adds 3,417 net new symbols on top (a 3% increase)
- Ubiquiti's running kernel has 6,357 symbols beyond Marvell's BSP — a further 5.5% increase
Phrased differently: roughly 1 in 19 symbols in the EFG's running kernel did not come from any source publicly available to a security researcher, GPL-rights-exercising customer, or independent third party.
This is the kernel that handles your VLAN traffic, your firewall rules, your VPN keys, and your DPI inspection. The behavior of this kernel cannot be audited from outside because the source for 5% of it is not published. The technical analysis in Section 12 demonstrates that this 5% includes substantial netfilter modifications.
The Linux kernel is licensed under GPL-2.0. That license imposes specific obligations on anyone who distributes a binary derived from GPL-licensed source. The relevant provisions, summarized:
- The complete corresponding source code must be made available to recipients of the binary, under the same license, for at least three years (GPL-2.0 §3).
- Changes to GPL'd source files must themselves be GPL-licensed (GPL-2.0 §2, the "viral" clause).
- Linking proprietary modules against GPL kernel symbols is a contested legal area. Linus Torvalds and the Linux Foundation's longstanding position is that modules that use only
EXPORT_SYMBOL(notEXPORT_SYMBOL_GPL) interfaces and "can plausibly be shown to be independent" may be distributed under non-GPL licenses, but there is no clean legal answer here. The Free Software Foundation's position is stricter: any kernel module is a derived work. - A written offer to provide source must accompany the binary distribution, valid for at least three years.
- Derived works that combine GPL and proprietary code in linked form typically must be GPL-licensed in their entirety.
Ubiquiti previously maintained an open-source download page at ui.com/download/open-source, but that page no longer exists. As of this writing (May 2026), Ubiquiti's main website does not host any GPL source code archives that we could locate. The Ubiquiti GitHub organization (https://github.com/ubiquiti) contains only two repositories: support-tools and freeswitch. Neither contains kernel sources or firmware sources for any current product.
This is not the first time Ubiquiti's GPL compliance has been questioned. The Wikipedia article on Ubiquiti documents a recurring pattern:
- 2015: Ubiquiti was accused of violating GPL terms for code in their products. Specifically, customers requested the source for the GPL-licensed U-Boot bootloader and Ubiquiti refused, making it impractical for customers to fix a security issue. The source was eventually released after sustained public pressure.
- 2019: Ubiquiti was again reported to be in violation of GPL.
- 2026 (current): The open-source download page that previously hosted source archives has been removed entirely.
For an EFG owner attempting to exercise their GPL rights today, the channels are:
- The Ubiquiti support email (
support@ui.com), which redirects GPL requests to a separate address - A specific email for source requests:
opensource-requests@ui.com - Community forum posts (which historically receive no substantive Ubiquiti response on GPL questions)
- Third-party archives like
github.com/unifi-hackers/unifi-gplandgithub.com/CodeFetch/Ubiquiti-UBNT-airOS, which contain partial GPL sources that researchers have extracted from firmware images or obtained through pressure
A formal request for the complete kernel source has been filed via opensource-requests@ui.com, the email address Ubiquiti's support team directed users to. The request specifies:
- The full kernel source tree corresponding to the running kernel version
- The build configuration (
/proc/config.gz) - The complete set of patches applied on top of the base kernel
- The Marvell-specific drivers (octeontx2_pf, octeontx2_vf, octeontx2_af, rvu_*, NIX, CPT, SSO, NPA)
- Any other GPL components
The request is pending. Ubiquiti's response (or non-response) to this request is itself a data point.
Section 13 documents 6,357 unique kernel symbols in the running EFG kernel that are not present in either vanilla Linux 5.15.72 or the complete published Marvell OCTEON CN9K BSP. These include:
- Symbols indicating modifications to core netfilter conntrack data structures (
nf_ct_ext_dpi_destroy,nf_conntrack_dpi_init) - A 116-symbol
tdtsnamespace exposing kernel functions to a closed-source DPI engine - HTTP and H.323 application-layer protocol decoders embedded in the kernel
- A 45-symbol Ubiquiti hardware abstraction layer
For Ubiquiti to be in compliance with GPL-2.0, the source of the changes producing these symbols must be available — at minimum to anyone who has purchased an EFG and exercises their GPL rights to request it.
Section 12 documented the panic trace's Modules linked in list, which included:
xt_geoip(O) nf_app(PO) t_miner(PO) tdts(PO) tm_crypto(O)
xt_dyn_random xt_dpi(O) ui_lcm(O) ubnthal(PO) ubnt_common(PO)
The (PO) taint flag is the kernel's own classification. It means the module is loaded with a MODULE_LICENSE() declaration that is not one of the GPL-compatible strings. The kernel taints itself when such modules are loaded specifically because their continued operation calls into question the kernel's GPL status.
Among these:
tdtsandt_minerare almost certainly licensed proprietary code from Trend Micro. They registerxt_matchnetfilter hooks and export functions liketdts_shell_dpi_l3_skb. They link directly against GPL kernel symbols (the__kstrtab_*and__ksymtab_*infrastructure exists for this purpose).nf_app,xt_dpi,xt_geoipare likely Ubiquiti's own proprietary netfilter extensions that integrate with the DPI engine.ubnthal,ubnt_common,ui_lcmare Ubiquiti's hardware abstraction layer.
The legal status of these modules is contested in general terms. The specific question for Ubiquiti is: are these modules "derived works" of the kernel? The Free Software Foundation says any kernel module is. Linus Torvalds has historically said it depends on whether the module uses EXPORT_SYMBOL_GPL interfaces and on whether the module has independent existence outside of Linux.
For tdts specifically: Trend Micro markets the underlying technology as portable across operating systems (it runs on Windows, FreeBSD, etc.), which would weigh in favor of "independent existence" under Torvalds's standard. For nf_app, xt_dpi, and ubnthal: these are by name and design Ubiquiti-specific kernel-only modules; they have no plausible existence independent of Ubiquiti's Linux distribution. Under either FSF's or Torvalds's standard, nf_app, xt_dpi, and ubnthal would appear to be derived works of the kernel and therefore subject to GPL.
The closed-source modules link against GPL kernel symbols using EXPORT_SYMBOL and EXPORT_SYMBOL_GPL exports. Some of those exports — particularly conntrack extension registration — were added by Ubiquiti's own kernel patches (per Section 13).
In other words: Ubiquiti modified the kernel (a GPL'd derived work, requiring source release) specifically to add GPL'd interfaces that proprietary modules would link against. Whether this is a GPL violation depends on the resolution of the GPL-vs-proprietary-module question, but it is a structurally significant observation: the proprietary modules and the kernel patches are designed to work together as a single integrated system. The kernel cannot be replaced without breaking the proprietary modules; the proprietary modules cannot run on any other kernel.
That tight integration is what FSF would call "a single program in two pieces" — a derived work. Under that interpretation, the entire firmware would need to be GPL-licensed, and the proprietary modules would be in violation.
If you own an EFG, you have a legal right under GPL-2.0 to request the complete source code of the kernel running on your device. That includes:
- The base kernel source, with full version history
- All patches applied by Ubiquiti and any third parties
- The build configuration (
.config) - Any installation/build scripts necessary to reconstruct the binary
- The kernel modules whose source is GPL
This right cannot be waived by EULA. If Ubiquiti refuses to provide this source, that refusal is a violation of GPL-2.0 §3, and the appropriate path forward is:
- Make a written request to
opensource-requests@ui.comspecifying the firmware version - If no response within 30 days, escalate to Ubiquiti's legal department
- If still no response, contact the Software Freedom Conservancy at
compliance@sfconservancy.org— they handle GPL enforcement on behalf of multiple Linux kernel copyright holders - The Conservancy can pursue compliance via the
kernel-enforcementprogram
The EFG is a flagship enterprise router from a publicly-traded networking vendor (Ubiquiti, NYSE: UI). It is sold to enterprises, cloud providers, government agencies, and home users. The firmware running on it includes 6,357 kernel symbols that no customer can audit because the source is not published.
Network device firmware is some of the most security-sensitive software in any infrastructure. The kernel running on a firewall or router decides what packets enter and leave the network. Bugs and backdoors in that kernel directly affect every device behind it.
GPL-2.0 was specifically designed to ensure that customers and security researchers can audit the software running on the devices they own. Vendor compliance with the license is not a courtesy — it is a precondition for the trust the GPL ecosystem makes possible.
The findings in this document — that even Marvell's complete public BSP source is insufficient to build modules that work on the EFG, that 6,357 symbols are unique to Ubiquiti's kernel, and that closed-source modules with (PO) taint flags are integrated with the netfilter subsystem — are exactly the kind of findings that demonstrate why GPL compliance is important. The license requires that this kind of analysis be unnecessary, because the source should be available.
Many of the findings in this document have already been raised with Ubiquiti through their official channels. The vendor's responses are themselves part of the record.
The author of this document opened a support ticket with Ubiquiti approximately one year prior to publication, describing the inter-VLAN performance bottleneck on the EFG and proposing the architectural fix in detail — specifically, recommending that Ubiquiti adopt the DPDK + VPP + Suricata-on-DPDK reference architecture that Marvell themselves publish for the OCTEON CN9K silicon family.
The ticket has not received a substantive engineering response. It remains effectively open without resolution.
This means the central technical recommendation of this document — that the EFG can deliver substantially higher throughput by adopting the dataplane architecture its silicon vendor publishes — was already in Ubiquiti's hands a year ago, with implementation guidance, and was not acted upon.
Section 12.1 of this document catalogues the security configuration choices in the EFG's running kernel:
module.sig_enforce=0— modules can be loaded without signature verificationCONFIG_MODULE_SIGnot set — the kernel was not even built with signing infrastructure- No
lockdown=argument on the kernel command line — Secure Boot LSM is not engaged CONFIG_SECURITY_LOCKDOWN_LSMnot set in the kernel build- Overlayfs root filesystem with a writable upper layer — kernel-loadable code can be persisted
kernel.kptr_restrict=0— the full kallsyms table with virtual addresses is exposed
Combined with the kernel's CONFIG_MODVERSIONS=N setting (Section 12.4), this means: any process with CAP_SYS_MODULE (root, including any context that escalates to root) can load arbitrary kernel code, and there is no in-kernel mechanism to detect or prevent that loading. The watchdog will reboot the device on a kernel panic, but a successfully-loaded malicious module that doesn't crash the kernel would persist indefinitely.
Separately, the author identified additional security findings on the EFG — notably the presence of private cryptographic key material accessible via the firmware image (per the *_FW_KEY strings observed in Section 13.6's symbol analysis, alongside other findings not detailed here for responsible disclosure reasons).
These findings were submitted through Ubiquiti's HackerOne bug bounty program — the formal, documented channel for security disclosure to the vendor.
Ubiquiti rejected the submission. The stated reason: the attacker would require network access to exploit the issue.
This rationale does not survive scrutiny when applied to a network gateway:
- A network gateway is, by definition, on the network. Network access to the device is the universal precondition for any attack against it.
- The threat model that a security-conscious gateway is designed to defend against is precisely "an attacker who has gained network access" — whether that's a compromised endpoint behind the gateway, a hostile guest device on the same VLAN, or an internal lateral-movement scenario in an enterprise breach.
- Gateway vendors with mature security postures (Cisco, Juniper, Palo Alto, Fortinet, Arista, etc.) routinely accept and remediate vulnerabilities under this threat model. CVEs against these products list "network adjacent" or "network reachable" as the qualifying attack vector, not a disqualifying one.
- The official CVSS v3.1 scoring system explicitly defines "Adjacent Network" (AV:A) and "Network" (AV:N) as valid attack vectors. A vendor declining to engage with vulnerabilities in those classes is declining to engage with most of the vulnerability landscape for their product category.
The rejection is therefore not just a technical disagreement — it is a stated position on what kinds of attacks Ubiquiti considers in scope for their bounty program. By that stated standard, an attacker who has already established a foothold on the network behind the EFG is not a threat the EFG considers itself responsible for defending against. That is an unusual posture for a $2,000 device sold and marketed as an enterprise security gateway.
Putting these data points together with the GPL findings in Section 14:
| Issue raised | Channel | Year | Vendor response |
|---|---|---|---|
| Inter-VLAN performance, with DPDK fix recommendation | Standard support | ~1 year ago | No substantive engineering response |
| Security configuration / private key exposure | HackerOne bug bounty | Recent | Rejected: "requires network access" |
| GPL kernel source release | Email to opensource-requests@ui.com | Pending | Pending |
| GPL kernel source release | Public web page | Historical | Page removed |
The historical context is also relevant: Ubiquiti was publicly accused of GPL violations in 2015 and again in 2019, and the pattern has continued.
The findings in this document are not surprising vendor disclosures. They are issues that engineering, security, and licensing teams within the vendor have either been told about or are demonstrably aware of and have chosen not to act on. The reason this document exists in public form is that the channels designed for these conversations — support tickets, bug bounty programs, GPL compliance contacts — have not produced action.
This investigation began as a performance analysis: why does a $2,000 enterprise router with two 25 GbE SFP28 ports deliver only ~1 Gbps of single-stream inter-VLAN throughput, and ~3 Gbps of single-stream PPPoE WAN throughput? The lab data is unambiguous. The bottlenecks are software-architectural choices, not hardware limitations:
- The kernel network stack on a single core has a ~5 Gbps single-stream ceiling when offloads are off, regardless of CPU vendor.
- Hardware offloads are disabled by default on the EFG. Enabling them is a 4-7× improvement on otherwise-identical configurations.
- The 5-deep iptables FORWARD chain pattern the EFG ships with costs roughly half of single-stream throughput when offloads are also off.
nftablesflowtable — a kernel feature available since Linux 4.16, shipped enabled by every major distribution, is not even compiled into the EFG's kernel. Adding it gives 3-7× single-stream improvement.- DPDK + VPP on the same silicon — using software stacks that Marvell themselves publish — would deliver 15-25× the throughput. The Cortex-A72-class cores in the Octeon CN9670 can sustain 6-12 Gbps per core in a userspace dataplane. The chip has 18 of those cores.
- PPPoE forwarding is single-cored in stock Linux because of how
ppp_genericis structured. The fix exists in DPDK and was being upstreamed at time of writing.
These are not exotic or research-grade fixes. Three of them are configuration changes. One requires loading a kernel module that's already in mainline. The most architecturally significant — DPDK + VPP — uses Marvell's own published reference architecture. The hardware was designed for this; the firmware just doesn't use it.
The conntrack helper toggle Ubiquiti recently shipped in the UniFi controller (Section 10 Finding 5, Section 11 Fix 7) is informative beyond its narrow effect. It exposes the FTP/H.323/SIP/PPTP/TFTP helpers as administrator-controllable. The toggle's existence proves Ubiquiti's engineering team is actively reasoning about per-flow netfilter overhead — they identified that helpers cost something, and shipped a workaround to let users disable them. They did not ship the proper fix, which is the kernel's flowtable infrastructure, even though the proper fix would address every architectural finding in this document and the partial fix addresses only one. That is a choice, not an oversight.
Section 9 extended the analysis from the EFG to the UDM Beast — Ubiquiti's next-generation gateway with newer Marvell Octeon CN10K silicon, ARM Neoverse N2 cores, an 18-month-newer kernel, and a dedicated Marvell switching ASIC. Direct diagnostics show the same architectural pattern: switchdev offload hard-disabled across every interface, tc filter rules explicitly tagged not_in_hw, 67 GB of WAN traffic processed through CPU-only software paths. The dedicated ASIC handles 1.27 billion packets of intra-VLAN switching but is bypassed for inter-VLAN routing. A faster CPU and a switching ASIC do not fix the architecture; they just raise the floor.
Section 12 documented our attempt to apply the most surgical of these fixes — adding the missing nftables flowtable kernel modules — to a real production EFG. Two builds were attempted:
- Vanilla Linux 5.15.72 from kernel.org → byte-perfect vermagic match → kernel panic at
nf_tables_init_net+0x18 - Marvell's complete published OCTEON BSP source (linux-yocto branch
v5.15/standard/cn-sdkv5.15/octeon) → byte-perfect vermagic match → kernel panic at the identical instruction
The fact that both crashes occurred at the same function offset proves that the ABI mismatch is not introduced by Marvell's BSP patches. It is introduced by something Ubiquiti has applied on top of Marvell's BSP — patches Ubiquiti has not published.
Section 14 quantified that delta: 6,357 kernel symbols exist in the running EFG kernel that are present in neither vanilla Linux 5.15.72 nor Marvell's complete public BSP. Approximately 1 in 19 symbols in the EFG's kernel is unique to Ubiquiti's build and not derivable from any public source. These include:
- Conntrack extension types for proprietary DPI integration (
nf_ct_ext_dpi_destroy,nf_conntrack_dpi_init) - A 116-symbol
tdtsnamespace exposing kernel internals to a closed-source Trend Micro DPI engine - HTTP and H.323 application-layer protocol decoders running in kernel space
- A 45-symbol Ubiquiti hardware abstraction layer
Section 14 addressed what these findings mean for GPL-2.0 compliance:
- Ubiquiti has shipped a substantially modified Linux kernel without publishing the corresponding source
- The proprietary kernel modules
tdts,t_miner,nf_app,xt_dpi,ubnthal, andubnt_commonlink against GPL kernel symbols and operate as integrated components of the running kernel - Specifically,
nf_app,xt_dpi, andubnthalhave no existence independent of Ubiquiti's Linux integration and would be derived works under either FSF's or Linus Torvalds's interpretation of the GPL - Ubiquiti's open-source download page has been removed; their GitHub presence does not contain firmware sources
- This continues a documented pattern — Ubiquiti was publicly accused of GPL violations in 2015 (resolved only after sustained pressure) and again in 2019
- A formal request has been filed via the channel Ubiquiti's support team specified
The GPL exists specifically so that customers can audit and modify the software running on devices they own. The fact that this analysis required reverse-engineering kernel symbol tables from a binary firmware image — when the GPL requires the source be available on request — is itself the finding.
Section 15 documented direct vendor engagement: a performance ticket open with Ubiquiti for approximately one year recommending the DPDK fix (no substantive engineering response), a security disclosure submitted through Ubiquiti's HackerOne bug bounty program (rejected on the grounds that exploitation requires network access — a position that does not survive scrutiny when applied to a network gateway), and the GPL request now pending. The findings in this document are not novel disclosures to the vendor; they are issues the vendor has been told about, through the channels designed for these conversations, and has chosen not to act on.
If you are evaluating or already operating EFG/UDM/UXG hardware, the questions to put to your Ubiquiti account team are:
- Performance: When will inter-VLAN single-stream throughput on the EFG match the marketed 25 GbE port speeds for normal enterprise workloads (TCP, MTU 1500, with stateful firewall rules)?
- Roadmap: Does Ubiquiti's roadmap include adopting DPDK-based dataplanes (which Marvell's reference architecture for this silicon recommends and supports)?
- Configuration: Will Ubiquiti expose
nftablesflowtable, hardware offload, and conntrack helper toggles as administrator-controllable settings before any DPDK migration? - GPL compliance: Will Ubiquiti publish the complete kernel source corresponding to current EFG firmware versions, including all patches, build configuration, and the source of
nf_app,xt_dpi,ubnthal, andubnt_common?
The first three are about getting the performance you paid for. The fourth is about knowing what's running on your network.
The EFG, UDM Beast, UXG-Lite, UXG-Pro, and other Ubiquiti gateways share substantial portions of this kernel and firmware design. The Section 9 cross-generation analysis on the UDM Beast establishes that the architectural pattern is not specific to one product or one silicon generation — it persists across newer SoCs, newer kernels, and even with dedicated switching ASIC hardware available. The performance characteristics documented here for the EFG are likely to apply, with proportional differences in absolute numbers, across the product line.
If your home or small-office workload is dominated by single-stream throughput (a single VPN tunnel, a single large file transfer, a single backup job), you are likely bottlenecked by the issues described above, regardless of how fast your internet connection or LAN switch is.
The most impactful workaround available without firmware changes is to enable hardware offloads where Ubiquiti's UI exposes the toggle. Beyond that, the architectural fix is in Ubiquiti's hands.
| # | NIC | Forwarder | MTU | Offloads | Rules | Single-stream | Notes |
|---|---|---|---|---|---|---|---|
| 1 | virtio | kernel | 9000 | on | none | 16.9 Gbps | naïve baseline |
| 2 | virtio | kernel | 9000 | off | none | 17.2 Gbps | jumbo hides per-packet cost |
| 3 | virtio | kernel | 1500 | off | none | 4.95 Gbps | EFG-realistic baseline; 1 core 100% soft |
| 4 | virtio | kernel | 1500 | off | + ct module | 4.84 Gbps | trivial overhead |
| 5 | virtio | kernel | 1500 | off | + simple ct rule | 4.64 Gbps | 4% drop |
| 6 | virtio | kernel | 1500 | off | EFG 5-chain replica | 2.36 Gbps | smoking gun |
| 7 | virtio | kernel | 1500 | off | EFG (8 streams) | 11.4 Gbps agg | scales with cores |
| A | virtio | kernel | 1500 | off | flowtable | 7.05 Gbps | flowtable alone, 3× over EFG |
| B | virtio | kernel | 1500 | on | flowtable | 17.4 Gbps | one-line config improvement |
| K1 | ConnectX VF | kernel | 1500 | on | none | 25.3 Gbps | real silicon baseline |
| K2 | ConnectX VF | kernel | 1500 | on | EFG 5-chain | 21.1 Gbps | GRO hides per-packet cost |
| K3 | ConnectX VF | kernel | 1500 | off | none | 4.74 Gbps | matches virtio with offloads off |
| K4 | ConnectX VF | kernel | 1500 | off | EFG 5-chain | 4.70 Gbps | I/O is the bottleneck here |
| V0 | virtio | VPP/DPDK | 1500 | off | n/a | 6.78 Gbps | DPDK with virtio-pmd; bottlenecked by vhost-net |
| V1 | ConnectX VF | VPP/DPDK | 1500 | client off | n/a | 15.7 Gbps | wire-packet processing |
| V2 | ConnectX VF | VPP/DPDK | 1500 | client on | n/a | 35.6 Gbps | headline number |
#!/usr/sbin/nft -f
flush ruleset
table inet filter {
chain alien_chain {
counter
ip protocol tcp counter
ip saddr 10.0.0.0/8 counter
}
chain tor_chain {
counter
ip protocol tcp counter
tcp flags & (syn|ack) == ack counter
}
chain ips_chain {
counter
ip protocol tcp counter
meta l4proto tcp counter
tcp dport { 1-65535 } counter
}
chain ubios_chain {
counter
ip protocol tcp counter
ct state established counter
}
chain user_chain {
counter
ct state established,related counter
ip saddr 10.10.10.0/24 ip daddr 10.10.20.0/24 counter
}
chain forward {
type filter hook forward priority 0; policy accept;
jump alien_chain
jump tor_chain
jump ips_chain
jump ubios_chain
jump user_chain
}
}
table ip nat {
chain postrouting {
type nat hook postrouting priority 100;
oifname "enp6s18" masquerade
}
}
#!/usr/sbin/nft -f
flush ruleset
table inet filter {
flowtable f {
hook ingress priority 0
devices = { enp6s19, enp6s20 }
}
chain forward {
type filter hook forward priority 0; policy accept;
ip protocol { tcp, udp } flow add @f
ct state established,related accept
}
}
table ip nat {
chain postrouting {
type nat hook postrouting priority 100;
oifname "enp6s18" masquerade
}
}
unix {
nodaemon
log /var/log/vpp/vpp.log
full-coredump
cli-listen /run/vpp/cli.sock
gid vpp
}
api-trace { on }
api-segment { gid vpp }
socksvr { default }
cpu {
main-core 0
corelist-workers 1
}
buffers {
buffers-per-numa 32768
default data-size 2048
}
dpdk {
dev 0000:01:00.0 {
name lab-vlan10
num-rx-queues 1
num-tx-queues 1
}
dev 0000:02:00.0 {
name lab-vlan20
num-rx-queues 1
num-tx-queues 1
}
}
plugins {
plugin default { enable }
plugin dpdk_plugin.so { enable }
}
Thread 1 vpp_wk_0 (lcore 1)
Time 257.0, vector rate 3.5586e5 in/out, packets/sec
Name Calls Vectors Packet-Clocks Vectors/Call
dpdk-input polling 2683609353 91446442 4.25e3 .03
ethernet-input active 12518445 91446442 9.41e1 7.30
ip4-input-no-checksum 12136093 91446437 3.98e1 7.54
ip4-lookup active 12136093 91446437 5.23e1 7.54
ip4-rewrite active 12136093 91446437 3.86e1 7.54
lab-vlan20-output active 10310280 89229310 1.21e1 8.65
lab-vlan20-tx active 10310280 89229310 3.79e1 8.65
VPP per-packet end-to-end cost on Zen 4: ~80 cycles (ethernet-input + ip4-input + ip4-lookup + ip4-rewrite + interface-output + tx) ≈ 16 nanoseconds per packet at 5 GHz. Theoretical ceiling on this pipeline: ~700+ Gbps single-core.
$ uname -a
Linux EFG-Home-SP 5.15.72-ui-cn9670 #5.15.72 SMP Wed Apr 15 23:39:47 CST 2026 aarch64
$ iptables -L FORWARD -n -v --line-numbers
Chain FORWARD (policy ACCEPT)
1 555K 775M ALIEN
2 2764K 4489M TOR
3 238M 354G IPS
4 874M 1342G UBIOS_FORWARD_JUMP
$ nft list flowtables
[empty]
$ lsmod | grep nf_flow_table
[empty]
$ ps -eo pid,pcpu,comm --sort=-pcpu | head -8
4098469 39.6 dpi-flow-stats
3139 12.5 ubios-udapi-ser
66687 7.8 java
4891 7.0 conntrackd
2491041 6.9 Suricata-Main
5505 6.2 mcad
8596 3.9 unifi-core
$ sysctl net.netfilter.nf_conntrack_max
net.netfilter.nf_conntrack_max = 10485760
$ lsmod | grep nf_conntrack | grep -v '^nf_conntrack '
nf_conntrack_tftp 262144 1 nf_nat_tftp
nf_conntrack_pptp 327680 1 nf_nat_pptp
nf_conntrack_h323 327680 1 nf_nat_h323
nf_conntrack_ftp 327680 1 nf_nat_ftp
Cross-compilation environment:
- Host: Threadripper Pro 7995WX, Ubuntu 24.04 LTS VM, 16 vCPU, 32 GB RAM
- Toolchain:
gcc-10-aarch64-linux-gnu10.5.0 from Ubuntu universe repo - Kernel source:
linux-5.15.72.tar.xzfrom kernel.org (verified SHA256) - Build configuration: EFG's exposed
/proc/config.gzplus three module enables forNF_TABLES,NF_FLOW_TABLE,NF_FLOW_TABLE_INET - LOCALVERSION:
-ui-cn9670(matching the EFG's published version string) - Build time: 1 minute 52 seconds (16-thread parallel build)
Modules produced:
net/netfilter/nf_tables.ko (10.3 MB)
net/netfilter/nf_flow_table.ko (1.8 MB)
net/netfilter/nf_flow_table_inet.ko (495 KB)
Vermagic verification (build host):
$ for ko in nf_tables.ko nf_flow_table.ko nf_flow_table_inet.ko; do
strings $ko | grep ^vermagic
done
vermagic=5.15.72-ui-cn9670 SMP mod_unload aarch64
vermagic=5.15.72-ui-cn9670 SMP mod_unload aarch64
vermagic=5.15.72-ui-cn9670 SMP mod_unload aarch64
Vermagic verification (EFG, in-tree module):
$ modinfo nf_conntrack_ftp | grep vermagic
vermagic: 5.15.72-ui-cn9670 SMP mod_unload aarch64
Match: exact, character-for-character.
Kernel panic on load attempt (insmod ./nf_tables.ko):
Unable to handle kernel NULL pointer dereference at virtual address 0x0000000000000120
ESR = 0x96000005, EC = 0x25: DABT (current EL), IL = 32 bits
FSC = 0x05: level 1 translation fault
[0000000000000120] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000
Internal error: Oops: 96000005 [#1] SMP
Code: 910003fd b9432021 f9000bf3 f9455400 (f8615813)
Kernel panic — not syncing: Oops: Fatal exception
Recovery: watchdog hard-reboot, ~2 minute downtime, no permanent damage. Failover to secondary gateway functioned correctly throughout.
Root cause: CONFIG_MODVERSIONS is disabled in the EFG's kernel config, so symbol-CRC verification did not catch the binary ABI mismatch between vanilla 5.15.72 and Ubiquiti's patched 5.15.72-ui-cn9670 build at module load time. The module linked successfully against the running kernel but encountered mismatched struct layouts during init, dereferencing a NULL pointer in the netfilter subsystem.
GPL source request status: filed with opensource-requests@ui.com requesting the complete corresponding source code for kernel 5.15.72-ui-cn9670, including all Ubiquiti and Marvell patches, build configuration, toolchain version, and packaging scripts. Outcome will determine whether the experiment can be re-attempted with a kernel tree that produces ABI-compatible modules.
All measurements were taken on a single physical machine over a continuous test session. Configuration files, scripts, and raw iperf3 outputs are available on request.
Build environment (same VM as A.7):
- Ubuntu 24.04 LTS, 16 vCPU, 32 GB RAM
- gcc-10-aarch64-linux-gnu 10.5.0
- linux-yocto repository, branch
v5.15/standard/cn-sdkv5.15/octeon - Repository URL:
https://git.yoctoproject.org/linux-yocto.git
Tree state:
$ git branch --show-current
v5.15/standard/cn-sdkv5.15/octeon
$ git log --oneline -3
7f33f19a49e6 (HEAD) Merge branch 'v5.15/standard/base' into v5.15/standard/cn-sdkv5.15/octeon
65333c3a0bcd Merge tag 'v5.15.203' into v5.15/standard/base
b9d57c40a767 Linux 5.15.203
Modifications to make HEAD identify as 5.15.72:
$ sed -i 's/^SUBLEVEL = .*/SUBLEVEL = 72/' Makefile
$ touch .scmversion # suppress dirty marker
$ make kernelrelease
5.15.72-ui-cn9670
Configuration (using EFG's /proc/config.gz as base):
CONFIG_LOCALVERSION="-ui-cn9670"
CONFIG_NF_TABLES=m
CONFIG_NF_TABLES_INET=y
CONFIG_NF_TABLES_IPV4=y
CONFIG_NF_TABLES_IPV6=y
CONFIG_NF_FLOW_TABLE=m
CONFIG_NF_FLOW_TABLE_INET=m
CONFIG_NF_FLOW_TABLE_IPV4=m
CONFIG_NF_FLOW_TABLE_IPV6=m
CONFIG_NF_FLOW_TABLE_PROCFS=y
# CONFIG_DEBUG_INFO_BTF is not set
# CONFIG_MODULE_SIG is not set
Build output:
$ time make -j16
real 1m59s
user 23m50s
sys 4m33s
$ for ko in $(find . -name 'nf_tables.ko' -o -name 'nf_flow_table*.ko' | sort); do
echo "=== $(basename $ko) ==="
strings $ko | grep -E '^(vermagic|name|depends|description)='
done
=== nf_flow_table_ipv4.ko ===
description=Netfilter flow table support
depends=nf_flow_table,nf_tables
name=nf_flow_table_ipv4
vermagic=5.15.72-ui-cn9670 SMP mod_unload aarch64
=== nf_flow_table_ipv6.ko ===
description=Netfilter flow table IPv6 module
depends=nf_flow_table,nf_tables
name=nf_flow_table_ipv6
vermagic=5.15.72-ui-cn9670 SMP mod_unload aarch64
=== nf_flow_table.ko ===
description=Netfilter flow table module
depends=
name=nf_flow_table
vermagic=5.15.72-ui-cn9670 SMP mod_unload aarch64
=== nf_flow_table_inet.ko ===
description=Netfilter flow table mixed IPv4/IPv6 module
depends=nf_flow_table,nf_tables
name=nf_flow_table_inet
vermagic=5.15.72-ui-cn9670 SMP mod_unload aarch64
=== nf_tables.ko ===
depends=
name=nf_tables
vermagic=5.15.72-ui-cn9670 SMP mod_unload aarch64
Crash trace from EFG load attempt:
[ 3368.013405] Unable to handle kernel NULL pointer dereference at virtual address 0
[ 3368.022216] Mem abort info:
[ 3368.025005] ESR = 0x96000005
[ 3368.028072] EC = 0x25: DABT (current EL), IL = 32 bits
[ 3368.033402] FSC = 0x05: level 1 translation fault
[ 3368.074382] Modules linked in: nf_tables(+) wireguard libchacha20poly1305 ...
xt_geoip(O) nf_app(PO) t_miner(PO) tdts(PO) tm_crypto(O)
xt_dyn_random ip6table_nat xt_conntrack xt_connmark xt_TCPMSS pppoe
pppox bonding xt_dpi(O) ip6table_mangle iptable_mangle ip6table_filter
ip6_tables uio_pdrv_genirq ui_lcm(O) ifb ppp_generic slhc
ubnthal(PO) ubnt_common(PO) drm drm_panel_orientation_quirks
[ 3368.121977] CPU: 3 PID: 211748 Comm: insmod Tainted: P W O 5.15.72-ui-cn9670 #5.15.72
[ 3368.130936] Hardware name: Marvell OcteonTX CN96XX board (DT)
[ 3368.143638] pc : nf_tables_init_net+0x18/0x94 [nf_tables]
[ 3368.149059] lr : ops_init+0x3c/0x120
[ 3368.227314] x2 : ffff00019027b300 x1 : 0000000000000000 x0 : 0000000000000000
[ 3368.234825] nf_tables_init_net+0x18/0x94 [nf_tables]
[ 3368.238053] ops_init+0x3c/0x120
[ 3368.242840] register_pernet_operations+0xec/0x240
[ 3368.247195] register_pernet_subsys+0x2c/0x50
[ 3368.252609] nf_tables_module_init+0x24/0x100 [nf_tables]
[ 3368.297899] ---[ end trace d3e1e407900e8e95 ]---
[ 3368.316500] Kernel panic - not syncing: Oops: Fatal exception
The HA failover handled the brief outage; service downtime was approximately 8 seconds.
# Step 1: Extract EFG kernel image (already gzip-compressed PE/COFF aarch64 image)
# from EFG: /boot/vmlinuz-5.15.72-ui-cn9670 (12 MB)
$ gunzip -c /boot/vmlinuz-5.15.72-ui-cn9670 > efg-vmlinuz
$ binwalk efg-vmlinuz | head -3
0 0x0 Linux kernel ARM64 image, image size: 29818880 bytes
# Step 2: Capture running symbol table (kallsyms is unrestricted on EFG)
# from EFG:
$ cat /proc/kallsyms > /tmp/efg-kallsyms.txt
$ wc -l /tmp/efg-kallsyms.txt
130789
# Step 3: Build vanilla 5.15.72 vmlinux (full build, not just modules)
$ cd ~/efg-build/vanilla-5.15.72/linux-5.15.72
$ make -j16 vmlinux
# Step 4: BSP vmlinux (already built for module experiment in 11.5)
# Step 5: Three-way symbol comparison
$ awk '{print $3}' /tmp/efg-kallsyms.txt | sort -u > /tmp/efg-syms.txt
$ nm ~/efg-build/marvell-bsp/linux-yocto-cnxk-5.15/vmlinux 2>/dev/null \
| awk '{print $3}' | sort -u > /tmp/bsp-syms.txt
$ nm ~/efg-build/vanilla-5.15.72/linux-5.15.72/vmlinux 2>/dev/null \
| awk '{print $3}' | sort -u > /tmp/vanilla-syms.txt
$ wc -l /tmp/*-syms.txt
115998 /tmp/bsp-syms.txt
120399 /tmp/efg-syms.txt
112581 /tmp/vanilla-syms.txt
# Step 6: Find symbols in EFG kernel but not in either public source
$ comm -23 /tmp/efg-syms.txt \
<(sort -u /tmp/vanilla-syms.txt /tmp/bsp-syms.txt) \
| grep -vE "^(\.L[0-9]+|\.LC[0-9]+|\.LBE|\.LFE|\.LFB|\.Letext|\.Ldebug|\.Lframe|__compound_literal\.|__func__\.|__warned\.|CSWTCH\.)" \
> /tmp/efg-unique-real-syms.txt
$ wc -l /tmp/efg-unique-real-syms.txt
6357Filter rationale: The grep -vE pattern excludes compiler-generated local labels (.L<N>, .LC<N>, .LBE<N>, etc.) which differ across every build of every kernel and carry no information about kernel structure. The remaining 6,357 symbols are real exported names, function names, and global variable names.
Top-level breakdown by name prefix:
$ awk -F'_' '{print $1}' /tmp/efg-unique-real-syms.txt | grep -v "^\." \
| sort | uniq -c | sort -rn | head -20
2646 (no prefix or various)
799 drm
195 bond
116 tdts
113 wg
104 my
66 fsv
59 ppp
51 mlxsw
46 shell
45 ubnthal
44 proc
44 get
42 dev
42 bonding
33 tm
32 nf
30 tcp
29 pppoe
27 ppu
Note: the drm count includes graphics driver code that may have come from a different source than vanilla or BSP (Ubiquiti uses Mediatek display panel for the EFG's front-panel LCD). The wg (WireGuard) count likely reflects an upstream backport. The tdts, tm, ubnthal, nf*dpi* numbers are the diagnostic ones.
The following text was sent to opensource-requests@ui.com:
Subject: GPL Source Request — Enterprise Fortress Gateway (EFG) Kernel Source
I am the owner of an Ubiquiti Enterprise Fortress Gateway (EFG) running firmware version [version], with kernel version 5.15.72-ui-cn9670. Per the terms of GPL-2.0, I am formally requesting the complete corresponding source code for this firmware's GPL-licensed components, including but not limited to:
- The complete Linux kernel source tree corresponding to 5.15.72-ui-cn9670, including:
- The base kernel source
- All patches applied by Ubiquiti and any third parties (Marvell, Trend Micro, etc.)
- The kernel build configuration (.config)
- The Marvell OCTEON CN9670 BSP drivers (octeontx2_pf, octeontx2_vf, octeontx2_af, rvu_*, NIX, CPT, SSO, NPA)
- Source code for any GPL-licensed kernel modules including those tagged with the GPL/GPL-compatible MODULE_LICENSE() declarations
- The device tree files (.dts, .dtsi) used by the firmware
- The build system, packaging recipes, and toolchain specification (compiler version, flags) sufficient to reproduce the binary
- Any other GPL components in the firmware (busybox, systemd, etc.)
Per GPL-2.0 §3, this source must be made available under the same license, in a form accessible to me. Acceptable delivery: a downloadable archive, a public git repository link, or physical media at cost.
[contact details]
The escalation path documented in Section 14.3 applies if no response is received.





I came across this whilst researching the EFG and UDM-Beast and to say marvelous work is an understatement.
I was wondering does Suricata in the EFG run on one core on the forwarding core (hence the inspection tax section)? Because I extracted the Suricata config from my UDM SE and it seems that they have a configuration for Suricata running across all cores (assuming they are using the same config in the EFG and UDM Beast). as evident by this:
According to Suricata docs:
Can you please check the file
/usr/share/ubios-udapi-server/ips/config/suricata_ubios_high.yamlon the EFG and UDM-Beast and if they have the same threading config as UDM-SE and are you suggesting that ubiquiti move themanagement-cpu-setto another core (i.e core 4 or core 2 or some other core)?.