galvesribeiro/EFG-Broken.md

Last active May 12, 2026 01:00

Star (19) You must be signed in to star a gist
Fork (1) You must be signed in to fork a gist

Select an option

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/galvesribeiro/89ce0232c8bc1971af84aee84746dc66.js"></script>
Save galvesribeiro/89ce0232c8bc1971af84aee84746dc66 to your computer and use it in GitHub Desktop.

Download ZIP

Why Your Ubiquiti EFG Can't Push 25 Gbps Inter-VLAN — and What's Actually Going On

Raw

EFG-Broken.md

Why Your Ubiquiti EFG Can't Push 25 Gbps Inter-VLAN — and What's Actually Going On

Or: How I Reproduced the Problem on x86, Tried to Load the Missing Modules on the Real Device, and What That Tells Us About Ubiquiti's Kernel

TL;DR

Ubiquiti markets the Enterprise Fortress Gateway (EFG) as a 25-gigabit-class router. The product page lists two 25 GbE SFP28 ports for WAN/LAN, and Ubiquiti positions the device as a flagship for medium and large enterprise deployments. Its silicon — a Marvell Octeon CN9670 — supports hardware-accelerated forwarding through purpose-built network engines (NIX) that should sustain tens of millions of packets per second. The UDM Beast, Ubiquiti's next-generation gateway, pairs a Marvell Octeon CN10K SoC (with ARM Neoverse N2 cores) with a dedicated Marvell Prestera-class switching ASIC accessed via PCIe — capabilities that, properly used, would offload most of the per-packet forwarding work into hardware.

In practice, real-world enterprise deployments report:

Inter-VLAN routing: ~1–1.5 Gbps single-stream, regardless of how fast the upstream link is
PPPoE WAN throughput: ~2–3 Gbps single-stream on 10 Gbps fiber connections, where the ISP requires PPPoE authentication
Total aggregate throughput: well below the marketed 25 Gbps WAN/LAN figures
With IDS/IPS enabled: Ubiquiti markets a "12 Gbps with IPS" rate, claimed against internet (LAN→WAN) traffic. In practice this is also unachievable on single-stream measurements — LAN→WAN traffic crosses a subnet boundary, hits NAT, and traverses the same single-core kernel forwarding path the writeup documents for inter-VLAN traffic, plus the additional cost of NAT mangling, plus PPPoE encapsulation if applicable. The 12 Gbps figure is achievable only as aggregate throughput across many parallel TCP flows with TSO/checksum offload and CPU work spread across cores via RSS. Single-stream LAN→WAN throughput with IPS enabled is bounded by the same per-core kernel forwarding ceiling as inter-VLAN — typically 1–2 Gbps.

This document analyzes both bottlenecks. It reproduces both problems in a controlled lab environment on x86 hardware, identifies the specific software architectural choices that cause them, demonstrates fixes whose effects can be measured to a precision of a few hundred Mbps, and documents in detail what happened when we attempted to apply the most surgical of those fixes — adding the missing nftables flowtable module — to a real production EFG.

We will show that the EFG's stock configuration delivers between 5% and 15% of the throughput its silicon is capable of. We will show three independent fixes that together can push it from ~1 Gbps single-stream to over 25 Gbps single-stream — without adding hardware. Two of those fixes are pure software configuration changes; the third is a kernel module that exists in mainline Linux and is shipped by Marvell themselves, but is not present in Ubiquiti's kernel build.

We then attempt to install the missing module on a real EFG. Building it against vanilla Linux 5.15.72 produces a kernel module with byte-perfect vermagic — and crashes the device on load. Building it against Marvell's complete published OCTEON BSP source from the Yocto Project produces another byte-perfect module that crashes at the identical function offset. Symbol-level analysis of the running EFG kernel reveals 6,357 unique symbols that exist in neither vanilla Linux nor Marvell's complete public BSP. These include conntrack extensions for proprietary DPI integration (nf_ct_ext_dpi_destroy, nf_conntrack_dpi_init), a 116-symbol tdts namespace exposing kernel internals to a closed-source Trend Micro DPI engine, and significant hardware abstraction additions.

To rule out the possibility that the EFG is an outlier, the same diagnostic methodology is applied to a second Ubiquiti gateway — the UDM Beast, with newer Marvell Octeon CN10K silicon (ARM Neoverse N2 cores), a kernel newer by 18 months (6.6.46), and a dedicated Marvell switching ASIC accessed via PCIe. The result: the same architectural pattern across silicon generations. The ASIC is physically present and processes 1.27 billion intra-VLAN packets, but switchdev offload is hard-disabled across every interface (hw-tc-offload: off [fixed]), tc filter rules report not_in_hw, and 67 GB of WAN traffic has gone through a CPU-only software path mirrored to an ifb device. Inter-VLAN routing, like on the EFG, runs entirely in the kernel software stack. A faster CPU moves the floor up; the architecture is unchanged.

The conclusion: Ubiquiti has built a substantially modified kernel that they have not released sources for, and Ubiquiti's open-source download page no longer exists. Their GitHub organization contains no firmware or kernel sources. Closed-source tdts and t_miner modules link directly against kernel symbols and operate as derived works of the kernel. This appears to violate GPL-2.0, and continues a pattern: Ubiquiti was publicly accused of GPL violations in 2015 (resolved after sustained pressure) and again in 2019.

The performance issues in this document have been reported to Ubiquiti through their support channel for approximately one year, including specific implementation guidance pointing to Marvell's published DPDK reference architecture; no substantive engineering response has been received. Separately, security findings about the EFG's deliberate absence of secure boot, module signing, and integrity protection were submitted through Ubiquiti's HackerOne bug bounty program and rejected on the grounds that the attacker would require network access — a rationale that does not survive scrutiny when applied to a network gateway.

This is therefore both a technical analysis and a software-license compliance analysis, and it is published only after the channels designed for vendor engagement have failed to produce a response.

A note on methodology and AI assistance

This document was produced collaboratively with an AI assistant (Anthropic's Claude). The AI's role was to help structure findings, draft prose, suggest diagnostic commands, and consolidate the final write-up. All measurements, kernel builds, module load attempts, packet captures, EFG and UDM Beast diagnostics, lab tests, and reproductions described here were performed by me on real hardware and in real VMs that I personally configured and operated. The hardware exists, the commands were run, the crashes happened on my devices, and the outputs are real.

AI assistance does not eliminate human error — and using it can introduce new sources of error when the AI fills in plausible-sounding details that don't match reality. I have done my best to validate the technical claims in this document against my own measurements, kernel source, and external references. A factual error has already been caught and corrected (an earlier revision incorrectly referred to nf_flow_table_pppoe as a separate kernel module — it is not; PPPoE flowtable handling is inline within nf_flow_table.ko). That correction was made because a reader pushed back (not in a friendly nor constructive way, as usual, but I digress), and I'm grateful they did.

If you spot anything else that looks wrong — a command output that doesn't match what your system shows, a kernel-internal claim that contradicts the source you can read on kernel.org, a misidentified piece of silicon, or anything else — please tell me. I'd rather revise this in public than leave a mistake standing or try to brute force misconception based on fanboyism or guesswork. The goal of the document is to accurately describe how the EFG and UDM Beast actually behave, not to win an argument.

Comments, corrections, reproductions and constructive criticism (like I got already in the comments ❤️) are always welcome.

The Problem
Test Environment
Methodology
The Reference Run: Real EFG Diagnostics
Reproducing the Bottleneck — virtio-net Test Matrix
Closing the Loop — Real Silicon Test Matrix
Userspace Dataplane — VPP/DPDK Comparison
The PPPoE Bottleneck — A Related but Distinct Problem
Cross-Product Confirmation: UDM Beast and UCG Fiber
Findings: The Architectural Failures
Recommended Fixes
Direct Experimental Verification — Building the Missing Modules
Symbol-Level Forensics on the Running EFG Kernel
The GPL Compliance Question
Direct Vendor Engagement: What Ubiquiti Has Already Been Told
Conclusion
Appendix: Full Data Sets

1. The Problem

In practice, real-world enterprise deployments report:

Inter-VLAN routing: ~1–1.5 Gbps single-stream, regardless of how fast the upstream link is
PPPoE WAN throughput: ~2–3 Gbps single-stream on 10 Gbps fiber connections, where the ISP requires PPPoE authentication
NAT throughput: similar single-flow ceilings whenever IPS, deep-packet-inspection, or threat management features are enabled

Customers complain, post mpstat screenshots showing one CPU core saturated while the other 17 sit idle, and get told it is a hardware limitation.

It is not. The CPUs are not the bottleneck. The silicon is not the bottleneck. The bottleneck is the configuration of the Linux kernel network stack that ships on the device, including:

Hardware offload features that are explicitly disabled
A modern kernel fast-path feature (nf_flow_table) that is not loaded
A user-space inspection engine running on the same CPU core that is forwarding packets
A 5-deep iptables FORWARD chain that every new connection must traverse
Conntrack protocol helpers loaded for legacy protocols (PPTP, H.323) that no enterprise control plane lets you disable
Per-VLAN bridges instead of a vlan-aware single bridge
No DPDK fast-path despite Marvell shipping first-class DPDK PMDs (cnxk) for these exact SoCs

Each one of these contributes measurable overhead. Combined, they drop forwarding throughput by an order of magnitude. The point of this article is to measure each contribution independently and show what a properly-configured Linux router looks like on the same workload.

2. Test Environment

Host Machine ("skywalker")

CPU: AMD Ryzen Threadripper Pro 7995WX, 96 cores / 192 threads, base 2.5 GHz, boost 5.1 GHz, Zen 4 microarchitecture
RAM: 754 GB DDR5 ECC
Hypervisor: Proxmox VE 9.0.11
Kernel: Linux 6.14.11-4-pve
Storage: NVMe ZFS root pool (rpool)
Networking: Mellanox ConnectX-6 Dx dual-port 100 Gbps NIC (MT2892), bonded LACP 802.3ad
IOMMU: AMD-Vi enabled in passthrough mode
Hugepages: 64 × 1 GB = 64 GB reserved at boot

Reference Device

EFG (Enterprise Fortress Gateway), Ubiquiti Networks
- Marvell Octeon CN9670 SoC, 18 ARM v8.2 cores @ 2.0 GHz
- 64 GB RAM
- Linux 5.15.72-ui-cn9670 (vendor build)
- Live production firewall, 8 days uptime at capture, 7 active VLANs in an enterprise office network

Lab VM Topology on skywalker

Three Ubuntu 24.04 LTS VMs were cloned from a common template, each pinned via a Proxmox hookscript to a dedicated CCD on the host (8 vCPUs each):

   192.168.6.0/24  (mgmt — for SSH, never used for test traffic)
        |          |          |
        +----------+----------+
        |          |          |
   gw-router    client1    client2
   (VM 200)    (VM 201)    (VM 202)
   8 cores     8 cores     8 cores
   16 GB RAM   8 GB RAM    8 GB RAM
   Cores 8-15  Cores 16-23 Cores 24-31

For test traffic (multiple network paths used in different tests):

   client1 ────[VLAN 10]──── gw-router ────[VLAN 20]──── client2
              10.10.10.10                                10.10.20.10
                   ↕                                          ↕
              gw-router                                  gw-router
              10.10.10.1                                10.10.20.1

The VMs received traffic through one of three I/O paths during testing:

virtio-net through Linux bridges with VLAN tagging (vmbr1 on the host)
ConnectX-6 Dx VFs via SR-IOV passthrough (4 VFs total, 2 to gw-router, 1 to each client)
VPP/DPDK with the same VFs polled directly by VPP's worker threads in userspace

Single TCP stream iperf3 at MTU 1500 was used as the primary measurement. Multi-stream tests with -P 8 were used in select cases to demonstrate scaling behavior. Each measurement ran for 30 seconds with per-second reporting; the values reported are the iperf3 sender/receiver final summary, which agree to within 0.1 Gbps in all cases.

3. Methodology

To prove the architectural argument we needed to isolate independent variables:

Variable	Settings tested
I/O fabric	virtio-net (vhost-net backend), ConnectX VF (SR-IOV passthrough)
MTU	1500, 9000
Hardware offloads (GRO/TSO/LRO)	on, off
Forwarding rules	none, EFG-replica 5-chain ruleset
Forwarder	kernel `ip_forward`, kernel `ip_forward` + nftables flowtable, VPP/DPDK userspace

For each combination, single-stream iperf3 between client1 and client2 (i.e. across the gw-router VM, between two distinct IPv4 subnets) was measured. Because the host CPU does not vary across tests and because vCPU pinning is fixed via a Proxmox hookscript that calls taskset after the VM starts, every test runs on the same physical cores in the same NUMA configuration.

The "EFG-replica 5-chain ruleset" was constructed from observation of the live EFG. It mirrors the EFG's iptables FORWARD structure of ALIEN → TOR → IPS → UBIOS_FORWARD_JUMP → user → default chains, with conntrack lookups, protocol/port matchers, and per-chain counters that force per-packet evaluation in the slow path. The exact ruleset is in the appendix.

4. The Reference Run: Real EFG Diagnostics

Before running anything in the lab, we captured the configuration of a production EFG to know what we needed to reproduce. Every command below was executed on a customer-deployed EFG running stock Ubiquiti firmware. None of these settings are user-configurable from the UI — they are baked into how the device is configured through the UniFi Web UI, which eventually reflects on the changes in the underlying Linux subsystems.

4.1 — Hardware and Kernel

$ uname -a
Linux EFG-Home-SP 5.15.72-ui-cn9670 #5.15.72 SMP Wed Apr 15 23:39:47 CST 2026 aarch64

$ nproc
18

$ free -h
              total        used        free      shared  buff/cache   available
Mem:           63Gi        11Gi        46Gi       106Mi       5.3Gi         44Gi

$ uptime
02:09:29 up 8 days, 5:17, 1 user, load average: 2.52, 1.84, 1.86

Confirmed: Octeon CN9670 (per the kernel build identifier), 18 cores, 64 GB RAM. Kernel 5.15 dates from late 2021 — it predates several material networking improvements in 5.19+ (better flowtable hardware offload, improved nft, better mptcp, PPPoE flowtable acceleration in 6.2+).

4.2 — The 5-Deep FORWARD Chain (Smoking Gun #1)

$ iptables -L FORWARD -n -v --line-numbers
Chain FORWARD (policy ACCEPT 1033 packets, 157K bytes)
num   pkts bytes target                source       destination
1     555K  775M  ALIEN                 0.0.0.0/0    0.0.0.0/0
2     2764K 4489M TOR                   0.0.0.0/0    0.0.0.0/0
3     238M  354G  IPS                   0.0.0.0/0    0.0.0.0/0
4     874M 1342G  UBIOS_FORWARD_JUMP    0.0.0.0/0    0.0.0.0/0

In 8 days of uptime, this device has pushed:

874 million packets through UBIOS_FORWARD_JUMP
238 million through the IPS chain
2.76 million through TOR
555 thousand through ALIEN

Every packet that this gateway routes traverses at least 4 jump targets in sequence, plus whatever rules live inside each. Total rule count across filter, mangle, and nat tables:

$ iptables -t filter -L -n | wc -l
572
$ iptables -t mangle -L -n | wc -l
187
$ iptables -t nat -L -n | wc -l
80

839 rules total. And it's all running on the legacy iptables (xt_*) backend. The modern nft API is not in use:

$ nft list ruleset | wc -l
0

4.3 — No Flowtable. None. (Smoking Gun #2)

$ nft list flowtables
[empty output]

$ lsmod | grep -iE "flow_table|flowtable"
[empty output]

$ for iface in eth0 eth1 eth2 eth3; do
    ethtool -k $iface | grep hw-tc-offload
  done
[no output - module not loaded, feature not available]

The nf_flow_table kernel module is not loaded. There is no nft flowtable. There is no hardware tc-flower offload. The kernel's modern fast-path infrastructure — which can bypass conntrack and rule evaluation for established flows — is not even installed on this device.

This single missing piece is, as the lab measurements will show, worth a 3× to 7× single-stream throughput improvement on its own.

4.4 — Conntrack Sized for 10 Million, Currently 846 Used

$ sysctl net.netfilter.nf_conntrack_max
net.netfilter.nf_conntrack_max = 10485760

$ sysctl net.netfilter.nf_conntrack_count
net.netfilter.nf_conntrack_count = 846

$ lsmod | grep nf_conntrack
nf_conntrack_tftp     262144  1 nf_nat_tftp
nf_conntrack_pptp     327680  1 nf_nat_pptp
nf_conntrack_h323     327680  1 nf_nat_h323
nf_conntrack_ftp      327680  1 nf_nat_ftp

Four conntrack protocol helpers loaded: FTP, PPTP, H.323, TFTP. PPTP is a deprecated VPN protocol from the late 1990s. H.323 is a videoconferencing protocol from 1996, mostly displaced by SIP. TFTP and FTP are increasingly rare in modern enterprise environments.

The actual per-packet cost of having helpers loaded is more nuanced than "every packet is inspected" — see Section 10 Finding 5 for the precise breakdown. The short version: established non-helper flows pay essentially nothing per packet (a pointer check), but every new connection pays a hash lookup against the helper registry, and any flow on a helper-recognized port (FTP/21, etc.) pays the full inspection cost.

A "Firewall Connection Tracking" toggle does exist in the UniFi controller's Gateway settings, allowing administrators to disable individual helpers (FTP, H.323, SIP, GRE, PPTP, TFTP). Disabling them all unloads the helper modules from memory entirely. This addresses the lookup cost on new flows but does not affect already-established TCP throughput (the iperf3 inter-VLAN measurement is unchanged), and does not address the bigger architectural bottlenecks documented in Sections 5-10. Section 10 Finding 5 expands on what helpers actually cost and what would be required to keep helper functionality without the cost.

4.5 — The Inspection Tax (Smoking Gun #3)

$ ps -eo pid,pcpu,pmem,comm --sort=-pcpu | head -10
    PID %CPU %MEM COMMAND
4098469 39.6  0.0 dpi-flow-stats
   3139 12.5  0.1 ubios-udapi-ser
  66687  7.8  3.1 java
   4891  7.0  0.0 conntrackd
2491041  6.9  1.6 Suricata-Main
   5505  6.2  0.0 mcad
   8596  3.9  0.9 unifi-core
   4482  3.8  0.0 ulogd

dpi-flow-stats consuming 39.6% of one CPU core continuously. Add Suricata IPS (6.9%) and conntrackd (7.0%) and you have ~54% of one core permanently consumed by per-packet inspection processes that don't forward anything — they just observe.

The CPU pinning details matter here, and the writeup's earlier framing — which lumped these userspace processes together as "running on the forwarding core" — oversimplified the picture. The accurate picture is:

Suricata is configured with explicit CPU affinity in /usr/share/ubios-udapi-server/ips_6/config/suricata_ubios_high.yaml:

threading:
  set-cpu-affinity: yes
  cpu-affinity:
    - management-cpu-set:
        cpu: [ 0 ]
    - receive-cpu-set:
        cpu: [ "all" ]
    - worker-cpu-set:
        cpu: [ "all" ]
        prio:
          default: "high"
    - verdict-cpu-set:
        cpu: [ 1 ]
        prio:
          default: "high"
  detect-thread-ratio: 1.0

nfq:
  mode: repeat
  repeat-mark: 1
  repeat-mask: 1
  bypass-mark: 1
  bypass-mask: 1
  fail-open: yes

This tells us several things:

Management thread is pinned to core 0 — single-threaded, single-core
Receive and worker threads can use all 18 cores
Verdict thread is pinned to core 1 — relevant in IPS mode (see below)
Both pcap: and nfq: sections coexist in the same configuration: the YAML supports either mode, with the active mode determined at Suricata launch time

The kernel command line additionally includes isolcpus=12, isolating core 12 from the general scheduler — likely reserving it for one of Suricata's worker threads when running.

The IDS/IPS toggle in the UniFi controller does not change Suricata's runtime architecture. This was verified directly: with "Intrusion Prevention" toggled ON in the UniFi controller, Suricata is still launched with --pcap, there are zero NFQUEUE rules in iptables (iptables-save | grep -i nfqueue returns empty), and /proc/net/netfilter/nfnetlink_queue is empty. The nfq: section and the verdict-cpu-set: [ 1 ] pinning in the YAML are dead config — they would activate only if Suricata were launched with -q <queue>, which it never is on this device.

Confirmed runtime architecture, from the running /var/log/suricata/suricata.log:

RunModeIdsPcapWorkers initialised
all 6 packet processing threads, 2 management threads initialized, engine started.

Suricata observes packets via libpcap on six bridge interfaces (one worker thread per bridge: br0, br254, br3, br5, br6, br7, configured in /run/ips/config/iface.yaml with threads: 1 per interface). It loads 32,033 signatures (per suricata.log: "32031 rules successfully loaded" + 2 threshold rules) and a closed-source Ubiquiti Suricata plugin: /usr/share/ubios-udapi-server/ips_6/suricata/lib/aarch64-linux-gnu/ubnt-idsips-daemon.so.

The running Suricata version is end-of-life software:

$ /usr/share/ubios-udapi-server/ips_6/suricata/bin/suricata -V
This is Suricata version 6.0.12 RELEASE

The Suricata 6.0.x branch was officially declared end-of-life by the upstream project on August 1st, 2024. Per the official announcement: "This means we'll be providing no more support, releases or (security) fixes for this branch. We strongly encourage everyone who is still using Suricata 6 or older to upgrade to Suricata 7 as soon as possible." The final 6.0.x release was 6.0.20.

Specifically:

Suricata 6.0.12 was released approximately April 2023
The EFG ships a version that is 8 patch releases behind the last 6.0.x release
The 6.0.x branch has received zero security fixes since August 2024 — over 21 months as of this document's publication date in May 2026
Suricata 7.0.x is the current LTS, supported until September 2026
Suricata 8.0.x is the latest major

The EFG ships the Suricata upgrade on every device and chooses not to activate it. The filesystem layout on this production EFG:

$ ls -la /usr/share/ubios-udapi-server/
drwxr-xr-x  2 root root  4096 Apr 22 21:09 ips/      ← 68-byte version selector
drwxr-xr-x  1 root root  4096 May  2 20:55 ips_6/    ← Suricata 6.0.12 (EOL, ACTIVE)
drwxr-xr-x  6 root root    81 Apr  8 06:24 ips_8/    ← Suricata 8.0.2 (current, INACTIVE)

$ /usr/share/ubios-udapi-server/ips_8/suricata/bin/suricata -V
This is Suricata version 8.0.2 RELEASE

$ ls /usr/share/ubios-udapi-server/ips_8/config/
afpacket.tmpl         category_list.json    iface.tmpl
reference.config      static_config.json    suricata_ubios_high.yaml

The ips_8/ directory is not a placeholder. It contains a fully working Suricata 8.0.2 binary, complete templates, the same suricata_ubios_high.yaml configuration filename used by the active ips_6/, and the full bin//config//rules//suricata/ packaging structure. The minimal ips/ directory contains only a version.json (68 bytes) — likely a version selector that decides which ips_N/ directory the running daemon points at.

Suricata 8.0.2 is also a substantial functional upgrade over 6.0.12. Per the Suricata 8 help output on this device:

Firewall:
    --firewall                  : enable firewall mode
    --firewall-rules-exclusive  : path to firewall rule file loaded exclusively

Suricata 8 introduces a native --firewall mode that could replace the iptables IPS chain + ipset pattern entirely with a Suricata-native rule engine. Adopting it would require Ubiquiti to port the closed-source ubnt-idsips-daemon.so plugin from the Suricata 6.x plugin API to the 8.x plugin API and to rewrite the integration glue. That work has not been done, or has been done and not deployed.

Either way, the situation is: Ubiquiti has the supported Suricata staged on every shipping EFG and has actively chosen to point the version selector at the end-of-life binary. This is not a "haven't gotten around to upgrading" situation — the upgrade is sitting on the device, ready to be selected. The decision to keep running EOL Suricata 6.0.12 in May 2026, while Suricata 8.0.2 is shipped on the same device, is deliberate.

This is the inspection engine that an enterprise security gateway uses to detect threats on its data path. It is running unsupported software with no security patches for 21 months, on a device that costs approximately $2,000 and is marketed as a flagship security gateway, while the supported version is staged on the same device's filesystem.

TLS visibility on the EFG is selective at best. Suricata in pcap mode sees encrypted ciphertext on the wire for HTTPS sessions. Without TLS interception (an MITM proxy decrypting and re-encrypting traffic using a CA cert distributed to clients), Suricata cannot inspect HTTP response bodies, exfiltrated data, or C2 traffic inside HTTPS sessions. As of 2026, this represents the majority of internet traffic — making "IDS/IPS" coverage of HTTPS the central question for any inline security product.

This was tested directly. From a host behind the EFG with IPS enabled and 32,033 signatures loaded:

$ curl -s https://testmynids.org/uid/index.html
uid=0(root) gid=0(root) groups=0(root)

The response payload is the canonical test string for the Suricata GPL ATTACK_RESPONSE id check returned root signature — designed to fire on uid=0(root) byte sequences in HTTP response bodies. After 5+ minutes, on the EFG:

$ tail -100 /var/log/suricata/eve.json | grep -i "GPL ATTACK"
[empty]
$ tail -100 /var/log/suricata/fast.log | grep -i "GPL ATTACK"
[empty]
$ journalctl -u syslog-ng | grep -i "attack_response"
[empty]

No alert generated. No entry in any log. The signature payload reached the test host through the EFG's IPS without detection.

Ubiquiti does ship a TLS interception product, branded NeXT AI Inspection in the UniFi controller UI ("NextAI" in shorthand). It is an opt-in feature with three modes (Off, Simple, Advanced) and is not engaged for general HTTPS traffic by default. Per Ubiquiti's documented architecture, NeXT AI Inspection is a separate pipeline that:

Captures packets selected for inspection (per the configured domain inclusion list — by default "Specific" rather than "All")
Enqueues them to a RabbitMQ broker running on the EFG itself
A proprietary SSL inspection process dequeues the packets, decrypts using a UniFi-generated CA certificate, inspects content (including content-type filtering — blocking specific file types like archives, PDFs, and spreadsheets while allowing others), re-encrypts
Re-enqueues the inspected traffic to an outbound queue
Another component pulls from the outbound queue and forwards
Only after decryption does the pipeline send the cleartext to Suricata for signature inspection

This architecture has multiple problems beyond what the curl test demonstrated:

RabbitMQ in the data path. RabbitMQ is an Erlang-based AMQP message broker designed for inter-service messaging at millisecond timescales. Per-packet routing through an AMQP broker imposes TCP framing for AMQP, routing-key matching, persistence (or the cost of disabling it), at-least-once delivery semantics, and Erlang VM scheduler decisions on every packet. This is fundamentally incompatible with a multi-Gbps data path. Either NeXT AI's actual throughput is substantially lower than advertised or the broker is operating in a degraded mode that defeats most of what one would use AMQP for.
CA certificate distribution is unenforced and customer-managed. UniFi generates the CA certificate; the customer is responsible for installing it on each client. The UI explicitly notes: "Download and install the NeXT AI Inspection certificate on each client to avoid losing internet access. Use UniFi Identity for seamless certificate distribution." UniFi Identity is a separate product. On most networks, the cert ends up installed on managed corporate laptops but not on BYOD, IoT, mobile devices, guest-network clients, native apps with cert pinning, or any device the IT team doesn't directly control. The curl test above succeeded over HTTPS without certificate errors, meaning the test client correctly received the real testmynids.org certificate — NeXT AI was not in the path for that flow, either because the host lacks the UniFi CA or because the test domain wasn't on the inclusion list.
Inspection scope is selective by default. The UI's "What to Inspect" defaults to specific domains rather than all traffic. This is a defensible design choice for performance (you don't want to MITM Netflix or financial institutions), but it means even when NeXT AI is enabled, IPS visibility into HTTPS is limited to the inclusion list.
Suricata sees plaintext only for NeXT-AI-fronted flows. For all other HTTPS sessions — anything outside the inclusion list, anything from clients without the CA installed, anything in flows that bypass NeXT AI for performance — Suricata sees ciphertext only. The 32,033 signatures loaded — most of which target HTTP-layer attack patterns — are matched against encrypted bytes for the bulk of modern web traffic.

The combination is significant. The EFG's "IPS" can only inspect HTTPS traffic where ALL of these conditions hold simultaneously: (a) the destination domain is on the customer-configured NeXT AI inclusion list, (b) the client has the UniFi-generated CA certificate installed and trusted, (c) the flow is routed through the RabbitMQ-based NeXT AI pipeline, and (d) the flow's volume fits within whatever the broker can sustain. For any HTTPS traffic outside that intersection — which is the vast majority of internet traffic on a typical enterprise network — Suricata sees only ciphertext, and the IPS function is effectively non-existent for the most relevant threat vectors.

The architectural contrast with proper inline IPS is sharp here. Suricata 7.0+ supports DPDK mode, where Suricata runs as a pipeline stage on the dataplane workers. With DPDK + VPP + Suricata-on-DPDK and TLS interception integrated as a DPDK pipeline stage (using something like Intel's QAT for offloaded crypto, or even kernel TLS offload), packet decrypt → inspect → encrypt → forward happens entirely in userspace on dedicated cores, with no kernel→userspace copies, no broker hops, and no RabbitMQ. The throughput overhead of TLS-aware inline IPS in this architecture is in the low single-digit percent on modern hardware, not the order-of-magnitude penalty of the EFG's current RabbitMQ-based design.

The "IPS" data path is retroactive blocking by 3-tuple, not inline prevention:

                                                  (in-process, closed-source)
       ┌─────────────────────────────────┐       ┌───────────────────────┐
       │ Suricata --pcap                 │       │ ubnt-idsips-daemon.so │
       │ • 6 worker threads (one per     │ ───→  │ writes UNIX DGRAMs to │
       │   bridge: br0/3/5/6/7/254)      │       │ /run/ips/eve_alert    │
       │ • 2 management threads          │       │ .json socket          │
       │ • 32,033 signatures loaded      │       └───────────┬───────────┘
       └─────────────────────────────────┘                   │
                                                             ▼
                                                   ┌───────────────────────┐
                                                   │ ubnt-idsips-daemon    │
                                                   │ (separate userland    │
                                                   │ process, closed-      │
                                                   │ source)               │
                                                   │ • parses alerts       │
                                                   │ • populates ipset 'ips'
                                                   │   when IPS toggle on  │
                                                   │ • forwards to syslog  │
                                                   └───────────┬───────────┘
                                                               │ netlink
                                                               ▼
                                                   ┌───────────────────────┐
                                                   │ ipset 'ips'           │
                                                   │ hash:ip,port,ip       │
                                                   │ timeout 0, max 65536  │
                                                   └───────────┬───────────┘
                                                               │
                                                               ▼
                                                   ┌───────────────────────┐
                                                   │ iptables IPS chain    │
                                                   │ -m set --match-set    │
                                                   │   ips dst,dst,src     │
                                                   │ -j IPSLOGNDROP        │
                                                   └───────────────────────┘

The IDS/IPS toggle in the UniFi controller appears to control only one specific behavior in the ubnt-idsips-daemon process: whether it populates the ipset when alerts fire. Both modes use the same Suricata invocation, the same --pcap capture, the same workers, the same alerts. The difference is policy in a closed-source userland daemon, not architecture.

The ipset characteristics matter:

$ ipset list ips
Name: ips
Type: hash:ip,port,ip
Revision: 6
Header: family inet hashsize 1024 maxelem 65536 timeout 0 bucketsize 12
Size in memory: 208
References: 2
Number of entries: 0

Type hash:ip,port,ip: blocking is per-flow-tuple (source IP, destination port, destination IP), not per-source-IP
timeout 0: entries never expire; once added, blocked until ipset flush or device reboot
maxelem 65536: maximum 65,536 simultaneously-blocked tuples
Number of entries: 0: empty across multiple samples, on a production EFG with 8 days of uptime, IPS enabled, processing real multi-VLAN traffic with 32K signatures loaded

What this means in practice:

The first packet matching a signature always reaches its destination. Suricata observes it via pcap, generates an alert, the daemon receives the alert from the UNIX socket, parses it, and adds the 3-tuple to the ipset. By the time the kernel can drop based on ipset match, the malicious packet has already been forwarded.
Subsequent packets matching the same 3-tuple (source IP / dest port / dest IP) are blocked retroactively.
Detection-to-block latency is seconds — Suricata processes packets in batches, the daemon parses datagrams, ipset population goes through netlink. The end-to-end latency from "first malicious packet observed" to "ipset blocks future packets" is non-trivial.
If an attacker uses one IP to probe (gets blocked) and a different IP for the actual attack, each new source IP gets one free shot before the ipset entry is added.
After 8 days of production uptime on a multi-VLAN enterprise gateway with IPS enabled, the ipset is empty. Either no traffic has triggered any of 32,033 signatures, or signatures fire but the daemon's policy threshold for ipset population isn't being met.

This is more accurately described as delayed reactive blocking than as Intrusion Prevention. An "Intrusion Prevention System" that observes an SQL injection payload, alerts on it, and then blocks future traffic from the source IP — after the original payload has already reached the database server — has not prevented the intrusion. It has only prevented follow-up traffic from the same source.

The architectural reason this design exists is interesting: properly inline IPS via NFQUEUE would put every inspected packet through Suricata's worker threads with verdict reinjection through a single core (the YAML's verdict-cpu-set: [ 1 ]). On the EFG's 2 GHz Octeon cores, this would significantly worsen the already-limited inter-VLAN forwarding throughput documented earlier in this document. By doing IPS retroactively via ipset population, Ubiquiti avoids creating a hard single-core verdict bottleneck on the data path — but at the cost of the IPS not actually preventing the malicious traffic it detects. The trade-off makes performance sense; it does not make security sense.

Closed-source surface area: two pieces of closed-source Ubiquiti code interact with GPL software in this pipeline. (1) ubnt-idsips-daemon.so is a Suricata plugin loaded as a .so into the Suricata process — runs in Suricata's address space, links against Suricata's exported plugin API, processes Suricata's internal data structures. (2) ubnt-idsips-daemon is a separate userland daemon that consumes alerts from Suricata's socket and writes to the kernel ipset via netlink. Suricata is GPL-2.0-licensed; whether the plugin is a derived work is the same question raised in Section 14 about the proprietary kernel modules.

dpi-flow-stats has no CPU pinning at all. Its affinity mask reads 0x3ffff (all 18 bits set) — meaning it can run on any core, and the kernel scheduler places it wherever. At the moment of one diagnostic capture, it was running on core 9. Earlier mpstat sampling showed it consuming 39.6% of CPU continuously across whatever cores it landed on, which can and does include the forwarding core for a given flow.

Single-flow softirq lands on a specific core by default for a single TCP flow (RX queue hashing places all packets of one 5-tuple on one queue, which is bound to one core). For the EFG and the typical default RSS configuration, this is often core 0. Whichever core gets the flow becomes the bottleneck core for that flow.

The implication for a single-flow workload (the iperf3 inter-VLAN test, or a Veeam backup, or any single TCP stream):

The forwarding softirq runs on whichever core RSS picks (typically core 0)
Suricata's management thread is pinned to core 0 — it competes with forwarding softirq when the flow lands there
Suricata's verdict thread is pinned to core 1 — separate from forwarding for a flow on core 0, but a constraint if traffic ever lands on core 1
Suricata's workers are distributed across all cores including the forwarding core
dpi-flow-stats can land anywhere including the forwarding core, with no restriction

What the lab measurements demonstrate isn't "Suricata is bottlenecking the forwarding core" — Suricata's core consumption is spread by design. What they demonstrate is that userspace processes consuming cycles on the same physical core that's doing forwarding softirq directly reduce that flow's throughput. The forwarding-core contention sources, on this configuration, are: Suricata's management thread (pinned to core 0), Suricata workers (when scheduled to core 0), and dpi-flow-stats (any core including 0). The cumulative effect is observable in per-core CPU samples and reproducible in the lab.

Aggregate vs single-stream: for a multi-flow workload, work spreads across cores via RSS hashing and the per-core contention is less visible because no single core is the bottleneck. The single-stream case is what user-visible problems look like (Veeam replication, large file transfers, single iperf3 streams, individual users on a Fast.com test). That's why the lab measurements isolate single-stream — it's the user-facing failure mode, even though aggregate throughput looks healthier.

Concrete fixes that don't require an architectural rewrite:

Move Suricata's management-cpu-set from core 0 to core 2 or higher (any core not on the dominant RSS hash path). One-line YAML change.
Pin dpi-flow-stats away from core 0 via taskset or systemd CPUAffinity=.
Configure RSS to hash inter-VLAN flows away from cores 0 and 1 entirely, since both are pinned by Suricata threads.

These are small wins — perhaps 10-20% on single-stream throughput at best — compared to the architectural fixes (flowtable, DPDK + VPP, Suricata 7.0+ in DPDK mode), which deliver 5-25× improvements. But they are real, they cost nothing, and they do not require a kernel rebuild or a feature redesign.

4.6 — 18 Cores Sitting Idle

$ mpstat -P ALL 1 3 | grep Average
Average:     all   4.07    0.00    3.67   0.07   0.17   0.24   0.00   0.00   0.00   91.78
Average:       0  18.40    0.00    1.39   0.00   0.35   0.00   0.00   0.00   0.00   79.86
Average:       1  13.20    0.00    1.32   0.00   0.33   0.33   0.00   0.00   0.00   84.82
Average:       2   2.68    0.00    2.01   0.00   0.34   0.34   0.00   0.00   0.00   94.63
Average:       3   6.38    0.00    1.68   0.00   0.00   0.00   0.00   0.00   0.00   91.95
[... 14 more cores all near 95–100% idle ...]

91.78% average idle across 18 cores during light load. Under a single-flow stress test the picture is sharper: one core at near-100% softirq (the kernel's softirq context where __netif_receive_skb_core and ip_forward run), seventeen sitting at 0%. Single-flow forwarding is fundamentally a single-thread workload in the Linux kernel network stack: a TCP flow's packets all hash to the same RX queue, the queue is bound to one core, and that core does all the work.

Adding cores does not help. Faster cores help linearly. Removing per-packet kernel-stack work helps dramatically. A userspace dataplane that polls the NIC across multiple worker cores can fix this entirely — see Section 7.

4.7 — Per-VLAN Bridges Instead of VLAN-Aware Bridge

$ ip -br link | grep -E "^br[0-9]"
br0     UP    192.168.196.1/24
br1111  UP    [no address shown]
br254   UP    192.168.254.1/24
br3     UP    192.168.3.1/24
br5     UP    192.168.5.1/24
br6     UP    192.168.6.1/24
br7     UP    192.168.7.1/24

Each VLAN gets its own bridge (br3 for VLAN 3, br5 for 5, br6 for 6, etc.) hanging off switch0 subinterfaces (switch0.3, switch0.5, etc.). Inter-VLAN traffic must traverse:

client (VLAN 3) → br3 → switch0.3 → switch0 → kernel L3 lookup
                                         ↓
                                   ip_forward
                                         ↓
                              switch0.5 → br5 → client (VLAN 5)

Every L3 hop is a kernel ip_forward operation. A modern vlan-aware single bridge with bridge vlan filtering enabled and nf_flow_table could short-circuit established flows in a software fast-path. This setup cannot.

4.8 — Summary of EFG Diagnostic Findings

Finding	Evidence	Impact
5-chain iptables FORWARD	874 M packets through `UBIOS_FORWARD_JUMP` in 8 days	Lab: 4.95 → 2.36 Gbps when applied (53% drop)
No flowtable, no module	`nft list flowtables` empty, `lsmod` shows no flow_table	Lab: virtio kernel 2.36 → 7.05 → 17.4 Gbps when added with offloads
Userspace inspection competing for forwarding core	dpi-flow-stats 39.6% CPU (no pinning, mask 0x3ffff), Suricata mgmt thread pinned to core 0 (= dominant RSS hash core)	CPU pressure on the specific core handling each flow's softirq
Hardware offloads disabled	`hw-tc-offload off [fixed]`, GRO off	Lab: 17 Gbps (on) → 5 Gbps (off) at MTU 1500
Per-VLAN bridges, no offload	7 separate br* devices	Forces every inter-VLAN packet through kernel L3
Legacy iptables, not nftables	`nft list ruleset` empty, 839 iptables rules	Slower per-rule, locked out of fast-path features
Conntrack helpers always-on, no UI toggle	nf_conntrack_{ftp,pptp,h323,tftp} all loaded	Per-packet helper traversal for unused protocols
18 cores, 1 used at a time	mpstat 91.78% idle average; single-flow saturates one core	Single-flow workloads cannot scale across cores in the kernel
Old kernel (5.15)	Predates several networking improvements including PPPoE flowtable	Locks out post-5.19 nftables, flowtable, and PPPoE acceleration
No DPDK	No `cnxk` PMD active despite full vendor support	Forfeits 5-15× throughput available from the same silicon

5. Reproducing the Bottleneck — virtio-net Test Matrix

The first round of tests used standard virtio-net VMs on Linux bridges — the closest analogue to "hypervisor in front of network silicon" without involving the ConnectX hardware directly. The bridge vmbr1 was configured as VLAN-aware with VIDs 10 and 20.

Test 1 — MTU 9000, offloads on, no rules (best case baseline)

$ iperf3 -c 10.10.20.10 -t 30
[ ID] Interval         Transfer    Bitrate
[  5] 0.00-30.00 sec   59.2 GBytes 16.9 Gbits/sec       sender
[  5] 0.00-30.00 sec   59.2 GBytes 16.9 Gbits/sec       receiver

16.9 Gbps. mpstat showed CPU 3 at ~12% softirq during the test. This is what jumbo MTU + GRO/TSO buys you: each "packet" through the forward path is a ~64 KB super-segment that the kernel processes once. Approximately 30,000 forward operations per second, each on one core.

Test 2 — MTU 9000, offloads off

[  5] 0.00-30.00 sec   60.1 GBytes 17.2 Gbits/sec

17.2 Gbps. Surprisingly similar. With MTU 9000, even without GRO, packets are 8960 bytes each — still only ~6× the per-packet overhead of TSO super-segments. The per-packet kernel cost doesn't dominate yet.

Test 3 — MTU 1500, offloads off (the EFG-realistic baseline)

This is the configuration that matches what real Ubiquiti customers experience. Standard internet MTU, no jumbo frames, no offloads.

[  5] 0.00-30.00 sec   17.3 GBytes  4.95 Gbits/sec

4.95 Gbps. mpstat showed CPU 6 at 100% softirq, all other cores idle. This is the same shape as the EFG diagnostic — one core saturated, the others doing nothing. The Zen 4 core at 5+ GHz, doing nothing but softirq packet forwarding, ceilings at this number.

If we naively scale this for an Octeon ARM core at 2.0 GHz (about 3–5× slower per cycle for this workload), we'd predict ~1.0–1.6 Gbps. Real EFG measurements are in this range. We are reproducing the right physics.

Test 4 — Adding nf_conntrack module (no rules)

$ sudo modprobe nf_conntrack
$ sudo sysctl -w net.netfilter.nf_conntrack_max=10485760

[  5] 0.00-30.00 sec   16.9 GBytes  4.84 Gbits/sec

4.84 Gbps. Almost no impact. Module load alone is cheap; conntrack's cost shows up when rules invoke it.

Test 5 — Simple ct rule

table inet filter {
    chain forward {
        type filter hook forward priority 0; policy accept;
        ct state established,related accept
        ct state new accept
    }
}

[  5] 0.00-30.00 sec   16.2 GBytes  4.64 Gbits/sec

4.64 Gbps. A 4% drop from a single conntrack rule. After the first packet of a single-flow iperf3 stream, the conntrack entry exists; lookup is O(1). The cost is real but small for a single long-lived flow.

Test 6 — EFG-replica 5-chain ruleset (the headline bad number)

The full ruleset emulating what we observed on the EFG: 5 jump chains, conntrack per chain, per-rule counters, multiple matchers per rule:

table inet filter {
    chain alien_chain  { counter; ip protocol tcp counter; ip saddr 10.0.0.0/8 counter }
    chain tor_chain    { counter; ip protocol tcp counter; tcp flags & (syn|ack) == ack counter }
    chain ips_chain    { counter; ip protocol tcp counter; meta l4proto tcp counter; tcp dport { 1-65535 } counter }
    chain ubios_chain  { counter; ip protocol tcp counter; ct state established counter }
    chain user_chain   { counter; ct state established,related counter; ip saddr 10.10.10.0/24 ip daddr 10.10.20.0/24 counter }

    chain forward {
        type filter hook forward priority 0; policy accept;
        jump alien_chain
        jump tor_chain
        jump ips_chain
        jump ubios_chain
        jump user_chain
    }
}

[  5] 0.00-30.00 sec   7.99 GBytes  2.29 Gbits/sec

2.29 Gbps. The smoking gun. A 53% drop from the no-rule baseline of 4.95 Gbps. CPU 5 was pegged at 100% softirq during the entire run.

This is the EFG's per-packet cost on a fast x86 core. Scaling for Octeon ARM at 2.0 GHz: ~500–800 Mbps. Matches user reports of EFG inter-VLAN performance in the wild.

Test 7 — 8 parallel streams with EFG ruleset

$ iperf3 -c 10.10.20.10 -t 30 -P 8
[SUM] 0.00-30.00 sec  39.7 GBytes  11.4 Gbits/sec

11.4 Gbps aggregate across 8 streams. mpstat showed 2–3 cores busy: different flows hashed to different RX queues, different queues bound to different cores. Multi-flow forwarding scales (somewhat), but single-flow performance does not — each stream caps near the per-core ceiling.

This is why a single backup transfer or large Veeam replication will saturate at 1 Gbps even though the WAN can do 25: the flow is one TCP connection.

Test A — Adding nftables flowtable (the magic config change)

We replace the 5-chain ruleset with a flowtable directive:

table inet filter {
    flowtable f {
        hook ingress priority 0
        devices = { enp6s19, enp6s20 }
    }

    chain forward {
        type filter hook forward priority 0; policy accept;
        ip protocol { tcp, udp } flow add @f
        ct state established,related accept
    }
}

[  5] 0.00-30.00 sec   24.6 GBytes  7.05 Gbits/sec

7.05 Gbps. A 3.0× jump from 2.36 Gbps. flowtable installs an ingress fast-path that, after the first few packets of a flow are tracked, bypasses conntrack lookup and FORWARD chain evaluation entirely. The packet still goes through netfilter ingress hook; the slow path is just skipped.

Test B — Flowtable + offloads on, MTU 1500

[  5] 0.00-30.00 sec   60.9 GBytes  17.4 Gbits/sec

17.4 Gbps. A 7.4× improvement over the EFG-style ruleset baseline (2.36 Gbps). Same hardware. Same kernel. Same single TCP stream. The only changes: flowtable directive added, offloads enabled.

virtio-net Test Summary

#	MTU	Offloads	Rules	Single-stream
1	9000	on	none	16.9 Gbps
2	9000	off	none	17.2 Gbps
3	1500	off	none	4.95 Gbps
4	1500	off	+ ct module	4.84 Gbps
5	1500	off	+ simple ct rule	4.64 Gbps
6	1500	off	EFG 5-chain replica	2.36 Gbps
7 (8-stream)	1500	off	EFG 5-chain	11.4 Gbps agg
A	1500	off	flowtable	7.05 Gbps
B	1500	on	flowtable	17.4 Gbps

6. Closing the Loop — Real Silicon Test Matrix

The virtio tests share a known limitation: virtio-net packets traverse the host's vhost-net kernel thread, which adds its own per-packet cost beyond what's in the guest. To prove that the kernel-stack overheads are independent of virtio's I/O fabric, we ran the same tests with SR-IOV pass-through of ConnectX-6 Dx Virtual Functions.

6.1 — SR-IOV Setup

The ConnectX-6 Dx supports up to 8 SR-IOV Virtual Functions per port. Without disturbing the existing LACP bond:

$ echo 4 > /sys/class/net/enp5s0f0np0/device/sriov_numvfs
$ cat /sys/class/net/enp5s0f0np0/device/sriov_numvfs
4

Four VFs were created (VF0-VF3), assigned dedicated MACs and isolated VLANs (110/120) at the eSwitch level, and passed through to the lab VMs:

VF0 (0000:05:00.2) → gw-router as VLAN 10 lab NIC
VF1 (0000:05:00.3) → gw-router as VLAN 20 lab NIC
VF2 (0000:05:00.4) → client1 (VLAN 10)
VF3 (0000:05:00.5) → client2 (VLAN 20)

The ConnectX-6 Dx eSwitch handled L2 between VFs in silicon — no traffic exited the physical port for the VLAN 10/20 lab traffic. The bond and the upstream network were unaffected.

Inside each VM, the VFs appeared as native ConnectX hardware via the mlx5_core driver. The VMs ran kernel ip_forward exactly as before; only the I/O fabric changed.

Test K1 — ConnectX VF, kernel forwarding, MTU 1500, offloads on, no rules

[  5] 0.00-30.00 sec   88.3 GBytes 25.3 Gbits/sec

25.3 Gbps single-stream. A 5.1× improvement over the equivalent virtio test (4.95 Gbps with offloads off). With offloads on, ConnectX hardware GRO is more efficient than virtio's, so the per-superpacket cost is even lower.

Test K2 — Same as K1, with EFG-style 5-chain ruleset

[  5] 0.00-30.00 sec   73.9 GBytes  21.1 Gbits/sec

21.1 Gbps. Only a 17% drop. With GRO collapsing wire packets into super-segments, the rule evaluation cost is amortized across ~40× fewer events. The EFG ruleset is still expensive per-event, but per-packet on the wire it's hidden by GRO.

Test K3 — ConnectX VF, offloads off, no rules

[  5] 0.00-30.00 sec   16.6 GBytes  4.74 Gbits/sec

4.74 Gbps. Statistically identical to the virtio-net test (4.95 Gbps). With offloads off, every wire packet hits ip_forward once. The per-packet ceiling on a Zen 4 core is the same regardless of NIC quality. The kernel stack itself is the bottleneck, not the I/O fabric, when offloads are off.

Test K4 — ConnectX VF, offloads off, EFG-style rules

[  5] 0.00-30.00 sec   16.4 GBytes  4.70 Gbits/sec

4.70 Gbps. Same as K3 within noise. The mlx5 kernel I/O path is heavier per-packet than virtio's vhost-net path — heavy enough that the EFG ruleset cost is hidden inside the I/O cost. Both paths still cap at the single-core software ceiling.

Real Silicon Test Summary

#	NIC	Offloads	Rules	Single-stream
K1	ConnectX VF	on	none	25.3 Gbps
K2	ConnectX VF	on	EFG 5-chain	21.1 Gbps
K3	ConnectX VF	off	none	4.74 Gbps
K4	ConnectX VF	off	EFG 5-chain	4.70 Gbps

The pattern is clear: with offloads off, the I/O fabric does not matter. With offloads on, it does. Hardware offloads collapse the per-packet processing cost in the kernel's hot path. Without them, even the world's fastest networking silicon ceilings around 5 Gbps single-stream because the kernel itself is the limit.

The EFG configuration disables hardware offloads. By doing so, it makes its own silicon irrelevant.

7. Userspace Dataplane — VPP/DPDK Comparison

VPP (Vector Packet Processor) is a userspace network dataplane built on DPDK that bypasses the kernel network stack entirely. It is what production-grade open-source routers (TNSR, DANOS) use, and it is what most enterprise-grade NFV appliances build on. We tested it both over virtio-net and over the ConnectX VFs.

A note on relevance to the EFG: Marvell ships a fully-supported DPDK Poll Mode Driver for the OCTEON family — the cnxk PMD, which covers CN9670 (in the EFG) and CN10K (in the UDM Beast). Marvell publishes reference architectures that combine OCTEON SoCs with VPP and DPDK-accelerated Suricata. Suricata itself has had native DPDK input mode since version 7.0 (released 2023). The components Ubiquiti would need to ship a userspace dataplane on the EFG are not research projects — they are vendor-blessed, production-deployed infrastructure that has been available for years.

Test V0 — VPP with virtio-net

[  5] 0.00-30.00 sec   23.7 GBytes  6.78 Gbits/sec

6.78 Gbps. Roughly equal to ip_forward + flowtable in the equivalent kernel test. VPP's show runtime revealed the cause:

dpdk-input    Vectors/Call: 0.05    Clocks/Packet: 1810
ip4-rewrite   Vectors/Call: 15.24   Clocks/Packet: 24.2

0.05 vectors per call on the input side. DPDK's whole performance story is amortizing per-syscall and per-context-switch overhead across batches of ~32–256 packets. Virtio-net feeds packets to DPDK one at a time. The polling loop is essentially empty. Userspace dataplane only delivers its promised speedup when paired with a userspace-friendly I/O backend (vhost-user) or real hardware.

Test V1 — VPP with ConnectX VF, offloads off on clients

[  5] 0.00-30.00 sec   54.9 GBytes 15.7 Gbits/sec

15.7 Gbps. Better than virtio-VPP (3× better) but actually worse than kernel-on-ConnectX with offloads on (25.3 Gbps). Why? VPP doesn't do GRO. It processes wire packets individually. With offloads off on the clients, every packet on the wire is 1500 bytes, and VPP processes ~1.4 million per second on one worker core.

The per-packet path through VPP is impressively cheap (ip4-input + lookup + rewrite + tx ≈ 78 cycles end-to-end on Zen 4) but it's still doing 40× more "work events" than the kernel + GRO setup, which only sees super-segments.

Test V2 — VPP with ConnectX VF, offloads on the clients (the headline)

[  5] 0.00-30.00 sec    124 GBytes  35.6 Gbits/sec

35.6 Gbps single-stream. Now the clients send fewer, larger TCP segments via TSO. ConnectX hardware can transmit each segment as a single frame on the wire (GSO/TSO offload at the NIC). VPP receives the resulting larger frames and forwards them with its low per-packet cost.

This is the headline number. 35.6 Gbps single-stream userspace dataplane forwarding on real silicon. Compared against the EFG's actual production performance on the same workload (~1 Gbps), this is the 15-35× ceiling that's possible with available open-source software on the same class of hardware.

VPP with show runtime during this test:

ip4-rewrite   Vectors/Call: 7.54    Clocks/Packet: 38.6
lab-vlan20-tx Vectors/Call: 8.65    Clocks/Packet: 37.9

VPP itself is doing 75–80 cycles of work per packet. On a 5 GHz core that's ~16 ns per packet. The theoretical ceiling for VPP on this hardware is hundreds of Gbps. The measured 35.6 Gbps is bottlenecked on the clients (their ability to generate packets), not on VPP.

7.1 — Estimating VPP/DPDK Throughput on the Octeon Silicon

The lab numbers are on Zen 4 at 5+ GHz. To estimate what VPP+DPDK would achieve on the EFG's ARM Cortex-A72-class cores at 2.0 GHz, we lean on published Marvell numbers and the cycle-counting visible in show runtime:

VPP per-packet cost in the lab: ~80 cycles on Zen 4 for full IP forwarding pipeline
ARM Cortex-A72 vs Zen 4 IPC for this workload: ~3-4× lower
Estimated cycles per packet on Octeon CN9670: 240-320 cycles
At 2.0 GHz: 6.25-8.3 million packets per second per core
At 1500-byte MTU: 9-12 Gbps single-stream per worker core
The Octeon CN9670 has dedicated NIX hardware engines that can offload portions of this further

Marvell's own published cnxk PMD benchmarks show single-core forwarding rates of 15-30 Mpps (millions of packets per second) for simple L3 forwarding, which corresponds to 18-36 Gbps at 1500 MTU per core. Across 4-6 worker cores (leaving control plane and inspection cores untouched), aggregate forwarding capacity easily reaches the 50 Gbps line rate of the EFG's two 25G ports, and single-stream throughput in the 15-25 Gbps range is realistic.

This means: on the same EFG silicon, with no hardware changes, a properly-architected DPDK dataplane should deliver 10-25× the inter-VLAN throughput the device achieves today, and eliminate the inspection-vs-forwarding CPU contention by giving each worker its own dedicated core with a vendor-supported PMD.

8. The PPPoE Bottleneck — A Related but Distinct Problem

Many enterprise customers (especially in countries where fiber-to-the-business is delivered via GPON or XGS-PON with PPPoE authentication) report that even when they have a 10 Gbps fiber link, single-stream throughput across their EFG WAN tops out around 2–3 Gbps. This is a separate bottleneck from inter-VLAN routing, but it has the same architectural root cause — and arguably worse manifestation, because the PPPoE path forces the kernel through multiple softirq passes per packet.

8.1 — Why PPPoE Is Slow on Stock Linux Routers

PPPoE encapsulates IP traffic in PPP frames inside Ethernet (ether_type 0x8864/0x8863). Every WAN packet must:

Be encapsulated/decapsulated by the pppoe.ko kernel module on every transit
Have its effective MTU reduced to 1492 bytes (eight bytes of PPPoE header), increasing per-packet overhead and forcing Path MTU Discovery
Be processed by pppd in userspace for LCP/IPCP control plane and link state — packet flow events get notified to userspace
Pass through additional packet copy for encapsulation/decapsulation in software
Bypass the kernel's flowtable fast-path — until kernel 6.2, nf_flow_table had no PPPoE support at all; flows traversing PPPoE could not be offloaded
Make multiple distinct kernel-stack passes: ingress on the underlying VLAN (eth2.11) → softirq 1 → pppoe_rcv → ip_input → ip_forward → ip_output → softirq 2 → pppoe_xmit → egress on the same or different VLAN

Combined with the per-packet kernel forward cost we measured (4.74 Gbps ceiling on a single Zen 4 core with offloads off), the additional encap/decap work, and the multi-pass softirq pattern, PPPoE single-stream throughput is fundamentally bound by:

Single-core ip_forward + pppoe.ko packet handling, which on a 2 GHz Octeon core lands in the 1-3 Gbps range — exactly what users report
No flowtable PPPoE acceleration (kernel 5.15 doesn't have it; the EFG runs 5.15)
Multiple softirq cores chained together, each handling part of the encap/decap/forward chain — this spreads CPU load across cores but adds latency and inter-core cache misses without actually speeding anything up
No DPDK PPPoE termination (would require accel-ppp or VPP's native PPPoE plugin in userspace)

8.2 — Direct Evidence on a Production EFG

The following data was captured during a single Netflix Fast.com speed test from a client device on the LAN, using the EFG's PPPoE WAN connection (Vivo XGS-PON, Brazilian ISP requiring PPPoE auth, link rated 1 Gbps but the same softwarepath would be used on a 10 Gbps link).

8.2.1 — Multiple ksoftirqd threads pegged simultaneously

$ top -bn1 -d 1 | head -15
top - 03:43:15 up 8 days, 6:51, load average: 3.81, 2.62, 2.33
%Cpu(s):  5.5 us,  5.0 sy,  0.0 ni, 52.5 id,  0.0 wa,  1.3 hi, 35.6 si,  0.0 st

    PID USER  PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
     23 root  20   0       0      0      0 R 100.0   0.0  17:11.14 ksoftirqd/2
     48 root  20   0       0      0      0 R 100.0   0.0   4:13.86 ksoftirqd/7
     63 root  20   0       0      0      0 R 100.0   0.0  21:58.12 ksoftirqd/10
     83 root  20   0       0      0      0 R  72.2   0.0  10:05.83 ksoftirqd/14
     73 root  20   0       0      0      0 R  66.7   0.0  16:39.62 ksoftirqd/12
     12 root  20   0       0      0      0 R  55.6   0.0  16:02.71 ksoftirqd/0
2491041 root   5 -15 1768064   1.1g  19584 S  44.4   1.8   6:18.31 Suricata-Main
   3139 root   5 -15  383232  68736  28416 S  22.2   0.1 1495:37  ubios-udapi-ser
   8596 root  20   0   20.1g 647232  85440 S  16.7   1.0  474:44  unifi-core

This is the smoking gun for PPPoE. Six different ksoftirqd threads are running at 55-100% simultaneously — cores 0, 2, 7, 10, 12, and 14 — all chewing through softirq work for what is fundamentally a single-flow workload (one TCP stream from Fast.com's backend server, through the PPPoE WAN, to the LAN client).

The reason this is even worse than the inter-VLAN smoking gun: inter-VLAN forwarding has one core saturated. PPPoE has multiple cores in continuous softirq because the path itself does multiple distinct kernel-stack passes per packet (eth2.11 ingress → pppoe_rcv → ip_input → ip_forward → ip_output → pppoe_xmit → eth2.11 egress). Each pass can land on a different core via softirq scheduling. The kernel is doing more total work per packet and spreading it across cores in a way that creates cache-coherence overhead between cores. It's the worst of both worlds — single-flow throughput limited by per-core ceiling, but multi-core CPU consumption.

The corresponding mpstat -P ALL output confirms the picture:

03:43:24    CPU    %usr    %sys    %irq    %soft    %idle
03:43:24    all    5.65    2.74    1.01    32.49    58.00
03:43:24      0   50.55    0.00    0.00    49.45     0.00
03:43:24      2    0.00    0.00    1.00    81.00    18.00
03:43:24      6    0.00    0.00    0.00    61.62    38.38
03:43:24     10    1.01    1.01    2.02    66.67    29.29
03:43:24     14    0.00    0.00    0.00    85.29    14.71
03:43:24     17    0.00    0.00    0.00   100.00     0.00

Six cores at 50-100% softirq during a single Fast.com speed test. The aggregate %soft of 32.49% across 18 cores corresponds to ~5.85 cores fully consumed by softirq work — for one flow.

8.2.2 — Concurrent userspace load on the same cores

While ksoftirqd is burning multiple cores, the inspection processes are also running:

Suricata-Main      44.4% CPU
ubios-udapi-ser    22.2% CPU
unifi-core         16.7% CPU
ulogd               5.6% CPU

That's ~89% of one core equivalent of additional userspace work, often landing on the same cores doing softirq. The result: the cores doing softirq are being preempted by userspace, and the userspace processes are being preempted by softirq, in a continuous round-robin that prevents either from getting clean cycles.

8.2.3 — Modules confirm pure software PPPoE path

$ lsmod | grep -i ppp
pppoe         327680  2
pppox         262144  1 pppoe
ppp_generic   327680  6 pppox,pppoe
slhc          262144  1 ppp_generic

$ ps -eo pid,pcpu,comm,args | grep pppd
2878806  0.0  pppd  /usr/sbin/pppd call ppp1 nodetach

The full software PPPoE stack is loaded: pppoe.ko for PPPoE-specific encap, pppox.ko for PPP-over-X dispatch, ppp_generic.ko for the PPP framing engine, slhc.ko for VJ header compression, and pppd in userspace for control plane (LCP, IPCP, keepalives). Every WAN packet traverses all of these in sequence.

8.2.4 — All hardware offloads disabled on ppp1, with `[fixed]` flags

$ ethtool -k ppp1 | grep -E "tcp-segmentation|generic-(receive|segmentation)|large-receive|hw-tc-offload"
tcp-segmentation-offload: off
tx-tcp-segmentation: off [fixed]
generic-segmentation-offload: off [requested on]
generic-receive-offload: on
large-receive-offload: off [fixed]
hw-tc-offload: off [fixed]

The [fixed] flag means the kernel module returns "this feature cannot be enabled" — they are hardcoded off in the ppp_generic driver. Even when generic-segmentation-offload was requested on (probably by some default state), the kernel refused. Pseudo-interfaces like ppp1 inherently can't do hardware TSO/LRO because there's no hardware behind them — it's a software encap layer. That's normal Linux behavior, but it means every PPPoE WAN packet gets TX-fragmented and RX-aggregated in software before being handed to or received from the underlying VLAN.

Note that generic-receive-offload: on does work for the receive path — but TSO does not exist on the egress side, so every outbound packet traverses the kernel stack individually.

8.2.5 — Confirmed: no flowtable module in this kernel build

$ modinfo nf_flow_table 2>&1
modinfo: ERROR: Module nf_flow_table not found.

$ find /lib/modules/$(uname -r) -name "nf_flow_table*"
[no output]

Not just unloaded — the kernel module doesn't exist on the system. nf_flow_table.ko is not compiled into Ubiquiti's 5.15.72-ui-cn9670 kernel build, nor is it available as an external module file. Even with root access, a customer cannot load the module to enable flowtable acceleration. The fast-path infrastructure isn't shipped at all.

(Note: PPPoE-specific flowtable handling lives inline within nf_flow_table.ko itself, not as a separate module. There is no nf_flow_table_pppoe.ko in mainline Linux; the PPPoE protocol checks and nf_flow_pppoe_proto() helpers are part of nf_flow_table_ip.c and nf_flow_table_inet.c, both of which compile into nf_flow_table.ko and nf_flow_table_inet.ko respectively. So the absence of nf_flow_table.ko is the absence of all flowtable functionality, including PPPoE acceleration.)

8.2.6 — MTU discrepancy confirmed

$ ip link show ppp1
ppp1: <POINTOPOINT,MULTICAST,NOARP,UP,LOWER_UP> mtu 1492

$ ip link show eth2.11
eth2.11@eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500

ppp1 MTU 1492 (1500 - 8 byte PPPoE header), eth2.11 MTU 1500. Every payload is 8 bytes smaller than it could be on raw Ethernet, increasing packet count for the same throughput. Small effect compared to the per-packet kernel cost, but it adds up at line rate.

8.3 — How PPPoE Integrates with DPDK Dataplanes

A reasonable question: PPPoE looks complicated, with control plane (PADI/PADO/PADR/PADS handshake, LCP/IPCP negotiation, keepalives, RADIUS) and dataplane (packet encap/decap) entangled. Can DPDK actually handle this, or is it fundamentally a kernel concept?

DPDK handles it well, but with a different architecture than the kernel uses.

The kernel's approach: pppoe.ko is a single module that does both control plane (handshake, LCP/IPCP, keepalives) and dataplane (encap/decap of every packet). Both run in softirq context, on whatever cores the kernel scheduler picks. The result is what we just measured: control plane and dataplane fighting for the same cores, with userspace processes (pppd) added on top.

DPDK splits this in two:

Control plane stays in userspace as a regular process. Tools like accel-ppp (the most common open-source PPPoE BNG implementation, deployed by ISPs to terminate hundreds of thousands of sessions per box) handle PADI/PADO/PADR/PADS, LCP/IPCP, keepalives, session lifecycle, RADIUS authentication — everything that happens at session establishment or once per second per session. This doesn't need to be fast; it needs to be correct. accel-ppp added DPDK support around 2020 and is what ISP-grade BNGs use today.
Dataplane runs as a fixed-cost pipeline stage. Once the session is up, every packet just needs an 8-byte header push (egress) or pop (ingress). In VPP (which has had a native PPPoE plugin since 2018), it's literally a node in the packet processing graph:

dpdk-input → ethernet-input → pppoe-input → ip4-input → ip4-lookup
           → ip4-rewrite → pppoe-encap → interface-output

The pppoe-input and pppoe-encap nodes are tiny — they push or pop 8 bytes, update some counters, and pass the packet to the next node in the same vector batch. Per-packet overhead for adding PPPoE to a VPP pipeline is roughly 30-50% above plain L3 forwarding, not 5-10× like the kernel softirq path imposes.

The critical difference: the kernel does control plane + dataplane on the same softirq path, blocking everything. DPDK does control plane in a slow, one-time-per-session userspace daemon, and dataplane as a small fixed-cost pipeline stage running on dedicated worker cores at line rate.

On Marvell silicon specifically: the Octeon CN9670 (the EFG SoC) is explicitly marketed by Marvell as a "Smart NIC and BNG" SoC. Their reference architectures combine:

The cnxk DPDK PMD handling raw Ethernet frames at line rate from the NIX hardware engines
accel-ppp running in userspace on dedicated control-plane cores, handling PPPoE control plane
Dataplane integrated into VPP's PPPoE plugin or a custom DPDK pipeline
Suricata in DPDK mode tapping the dataplane for inspection on dedicated worker cores

ISPs deploying this stack on Octeon hardware regularly hit 40+ Gbps PPPoE termination per box with 100K+ concurrent sessions. Companies like Calix, Adtran, and a handful of NFV vendors ship enterprise BNGs based on exactly this silicon, doing exactly this PPPoE workload, at 25+ Gbps per port. This isn't research — it's commodity, vendor-blessed, production-deployed infrastructure that has existed for years.

8.4 — The Fix Is Already in Mainline Linux (and DPDK)

Two independent fix paths exist:

Kernel path: Linux 6.2 (released February 2023) added PPPoE handling support to nf_flow_table. The PPPoE protocol checks and nf_flow_pppoe_proto() helpers were added inline within nf_flow_table_ip.c, nf_flow_table_inet.c, and nf_flow_table_offload.c — i.e., they compile directly into nf_flow_table.ko and nf_flow_table_inet.ko, not as a separate module. Once enabled, established TCP/UDP flows over PPPoE WAN can be offloaded to the same software fast-path as native L3 traffic, bypassing both pppoe.ko and the netfilter slow path for in-progress flows. Combined with hardware tc-flower offload on supported NICs, modern Linux distros (OpenWrt 23.05+, recent VyOS, MikroTik RouterOS 7) achieve near-line-rate PPPoE throughput on 10 Gbps links through software fast-path acceleration.

The EFG ships kernel 5.15 — released in late 2021, predating PPPoE flowtable acceleration by over a year. A kernel rebase to 6.6 LTS or later, with nf_flow_table.ko loaded and a flowtable directive added to nftables, would dramatically improve PPPoE WAN throughput without any hardware changes and without changing the dataplane architecture. The fix is a kernel module load and one nftables stanza.

DPDK path: Migrate the PPPoE termination from pppoe.ko in the kernel to accel-ppp + VPP's PPPoE plugin in userspace, on dedicated worker cores. This is the same architectural change as Fix 3 in Section 11 (DPDK + VPP for the dataplane), with PPPoE just being one more pipeline stage. Since Marvell ships full DPDK support for the Octeon CN9670 and publishes reference architectures combining DPDK + accel-ppp + VPP, this is integration work, not invention.

8.5 — Estimated PPPoE Improvement

Using the same scaling from Section 7.1:

Configuration	Single-stream PPPoE throughput	Notes
Current EFG (kernel 5.15, no flowtable, software pppoe.ko)	~2-3 Gbps	per user reports; matches our multi-core ksoftirqd evidence
EFG + kernel 6.6 + nf_flow_table loaded + flowtable rule	~5-8 Gbps	flowtable bypasses pppoe.ko + netfilter for established flows
EFG + kernel 6.6 + flowtable + hw-tc-offload	~8-9.5 Gbps	near line-rate on 10G PPPoE links
EFG + DPDK (accel-ppp + VPP PPPoE plugin)	line rate on 10 Gbps (and 25G aggregate)	what ISP-grade BNGs achieve on this exact silicon

The point: PPPoE performance is not a hardware problem either. It is the same architectural failure (single-core kernel forwarding without acceleration) compounded by an additional encapsulation layer that mainline Linux now supports accelerating, and that DPDK has handled at line rate for years. The same fixes apply, with PPPoE benefiting more than inter-VLAN does because the multi-pass softirq pattern is so much more expensive in the current implementation.

9. Cross-Product Confirmation: UDM Beast and UCG Fiber

The analysis up to this point is grounded in measurements of one Ubiquiti device, the EFG, plus a controlled lab reproduction on x86. A reasonable counter-argument is that the EFG might be an outlier — older silicon, an early-cycle product, an aberrant kernel build that newer products have moved past.

To address this, the same diagnostic methodology was applied to a second-generation Ubiquiti gateway: the UDM Beast. This is a newer, higher-end Ubiquiti product running a different SoC family (Marvell Octeon CN10K instead of CN9K), a substantially newer kernel (6.6.46 vs 5.15.72), and — critically — a dedicated switching ASIC that the EFG does not have.

The diagnostic question was: does the newer silicon, the newer kernel, and the dedicated switching ASIC change the inter-VLAN routing architecture?

It does not. The UDM Beast exhibits the same architectural pattern, with one important new wrinkle: the ASIC is physically present, powered up, and processing billions of packets — but only for intra-VLAN switching. Inter-VLAN routing continues to go through the same kernel software path as the EFG.

9.1 — UDM Beast Hardware Identification

$ cat /proc/cpuinfo | head
processor       : 0
BogoMIPS        : 100.00
Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp 
                  asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 
                  asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs sb 
                  paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 
                  svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
CPU implementer : 0x41
CPU architecture: 8
CPU variant     : 0x0
CPU part        : 0xd49
CPU revision    : 0

Decoded:

0x41 = ARM Holdings (the actual ARM, not Marvell-customized)
0xd49 = ARM Neoverse N2
8 cores @ 2.5 GHz
ARMv8.2-A with the full extension set: SVE2, BF16, I8MM, RNG, BTI, pointer authentication

The kernel reports as 6.6.46-ui-cn10k. CN10K is Marvell's OcteonTX 10, the next-generation networking SoC family after the EFG's CN9670. CN10K specifically uses ARM Neoverse cores instead of Marvell-customized ones, with substantially higher per-core throughput.

For comparison:

Property	EFG	UDM Beast
SoC	Marvell Octeon CN9670	Marvell Octeon CN10K
ARM core	Marvell custom (A57-class)	ARM Neoverse N2
Clock	2.0 GHz	2.5 GHz
Cores	18	8
Per-core IPC vs Zen 4	~0.55×	~0.75×
Predicted single-core throughput	22% of Zen 4	38% of Zen 4
RAM	16 GB	32 GB
Kernel	5.15.72-ui-cn9670	6.6.46-ui-cn10k

So per-core, the UDM Beast should single-thread inter-VLAN route at roughly 2× the EFG's rate. That's still a long way from saturating the 25G ports both devices ship with.

9.2 — The Switch ASIC Is Real and It Is Doing Work

The UDM Beast has a switch0 interface that aggregates physical Ethernet ports eth2 through eth13 as named slaves:

$ ip link show | grep -E '@switch0' | head -8
eth2@switch0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 ...
eth3@switch0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 ...
eth4@switch0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 ...
eth5@switch0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 ...
eth6@switch0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 ...
[...]

The switch0 virtual interface itself reports massive packet throughput:

$ ip -s link show switch0
switch0: ... 
    RX:  bytes        packets   ...
         1350339982078 1274250904 ...

That's 1.27 billion packets / 1.35 TB processed by the switch interface. This traffic does not appear on any individual physical Ethernet's RX/TX counters in the same volume — the kernel sees the aggregate but not the per-port breakdown, because the per-port traffic is being switched in hardware below the kernel's visibility.

A daemon process is actively managing the ASIC:

$ ps -eo pid,pcpu,comm | head -5
2617 root  75.0 cpss-manager
2913 root   0.0 cpss-app

$ ls -la /proc/$(pgrep -f 'cpss-app.*l3')/fd
0 -> /dev/null
3 -> /dev/shm/CPSS_SHM_MALLOC0          ← shared memory with cpss-manager
4 -> /dev/mvdma                          ← Marvell DMA device (direct ASIC access)
5 -> /sys/devices/.../0008:01:00.0/resource0  ← raw PCIe BAR0 of switch ASIC
6 -> /sys/devices/.../0008:01:00.0/resource2  ← PCIe BAR2

$ cat /proc/$(pgrep -f 'cpss-app.*l3')/cmdline
/usr/bin/cpss-app -l3

CPSS stands for CPU Subsystem Services — Marvell's proprietary management framework for their Prestera switch ASIC family. The cpss-app -l3 invocation suggests the ASIC supports L3 forwarding capability (the -l3 flag), and the daemon has direct memory-mapped PCIe access to the chip via /dev/mvdma and the resource0/resource2 BARs.

Notably, cpss-manager consumes 75% of one CPU core continuously — that's roughly 6% of the entire box's CPU just on switch ASIC management overhead. This is not anomalous load; it's the steady-state cost of managing the ASIC through Marvell's proprietary framework rather than the kernel's switchdev infrastructure.

9.3 — Switchdev Is Disabled Across All Interfaces

switchdev is the Linux kernel framework for offloading network functions to switching hardware. When properly engaged, it lets tc flower rules, bridge VLAN filtering, and L3 routing be programmed into the switch ASIC's hardware tables, with the kernel staying out of the per-packet path entirely.

On the UDM Beast, switchdev is not engaged. Every interface — physical port, bridge, virtual switch interface — reports the same offload flag pattern:

$ for iface in br0 br10 br20 br30 br50 br199 br200 \
               eth1 eth6 eth8 eth9 eth11 eth12 eth13 switch0; do
    echo "=== $iface ==="
    ethtool -k $iface 2>/dev/null | grep -E "hw-tc-offload|l2-fwd-offload|rx-vlan-filter"
  done

=== br0 ===
rx-vlan-filter: off [fixed]
l2-fwd-offload: off [fixed]
hw-tc-offload: off [fixed]
=== br10 ===
rx-vlan-filter: off [fixed]
l2-fwd-offload: off [fixed]
hw-tc-offload: off [fixed]
[... same pattern repeats for every interface ...]
=== switch0 ===
rx-vlan-filter: off [fixed]
l2-fwd-offload: off [fixed]
hw-tc-offload: off

The [fixed] qualifier is critical. It means the kernel driver does not even expose these features as toggleable. Compare this to a Linux box with a properly-supported switchdev driver (such as Mellanox/NVIDIA Spectrum or Marvell Prestera with the upstream prestera driver), where these flags would read on or be administrator-toggleable.

[fixed] here means the driver doesn't implement the switchdev API. The Linux kernel cannot push hardware-offloadable rules to the ASIC, because there's no driver code path to do so.

Bridge phys_switch_id files are present on every interface but are empty:

$ for nic in $(ls /sys/class/net/); do
    sw=$(cat /sys/class/net/$nic/phys_switch_id 2>/dev/null)
    [ -n "$sw" ] && echo "$nic -> $sw"
  done
[no output]

A populated phys_switch_id is how the kernel identifies multiple netdevs as belonging to the same hardware switch — a precondition for switchdev L2 forwarding offload. The files exist (so the netdev infrastructure has been initialized) but are unset, so the kernel does not know that eth2, eth3, etc. are ports of a common switch. Without that knowledge, no offload decision is possible.

9.4 — Direct Evidence the L3 Path Is Software-Only

tc filter rules attached to the WAN interfaces explicitly report not_in_hw:

$ tc -s filter show dev eth8 ingress
filter parent ffff: protocol all pref 49152 u32 chain 0 fh 800::800 ... not_in_hw 
  match 00000000/00000000 at 0
        action order 1: connmark zone 0 pipe
         Sent 67350180932 bytes 70901285 pkt
        action order 2: mirred (Egress Redirect to device ifbeth8) stolen
         Sent 67350180932 bytes 70901285 pkt

The not_in_hw flag is the kernel telling you, in plain text: this filter rule is running in software. Each WAN-side packet is being:

Pulled from the NIC into a kernel skb
Classified by a u32 filter on the CPU
Connmark-zoned on the CPU
Mirrored to an ifbeth8 IFB (intermediate functional block) for traffic shaping on the CPU
Then run through the iptables FORWARD chain on the CPU
Then forwarded out

The aggregate counter — 67.3 GB / 70.9 million packets — represents traffic that has gone through that entirely-CPU path on the primary WAN. None of it benefited from the switch ASIC sitting on the same PCB.

9.5 — The Same iptables Architecture as the EFG

$ iptables -L FORWARD -nv | head -10
Chain FORWARD (policy ACCEPT)
 pkts      bytes target
358M     755 GB ALIEN
366M     769 GB TOR
557M    1159 GB IPS
557M     ----- UBIOS_FORWARD_JUMP

Same multi-chain pattern as the EFG: ALIEN for general matching, TOR for Tor-related rules, IPS invoking Suricata-related work, and UBIOS_FORWARD_JUMP chaining into Ubiquiti's deeper rule set. Several hundred million packets have traversed each chain.

A more revealing chain on the UDM Beast is UBIOS_PREROUTING_PBR (policy-based routing), which is even more extensive than what the EFG ships. It contains numerous ipset matches, NFLOG actions, MARK manipulations, and — notably — L7 application classification tags written into ipsets by the userspace dpi-flow-stats daemon:

MARK ... cat 3 app 156  cat 3 app 150  cat 20 app 186  ...

Every packet hits this chain. The DPI pipeline that consumed 39.6% of one core continuously on the EFG (Section 4.5) is doing the equivalent work here, with the additional cost of L7 application-level matching against ipsets that userspace populates.

conntrack continues to track every flow:

$ wc -l /proc/net/nf_conntrack
31245 /proc/net/nf_conntrack

$ cat /proc/sys/net/netfilter/nf_conntrack_max
2097152

31,245 active conntrack entries on a low-traffic snapshot, against a 2-million-entry maximum. Every flow — including inter-VLAN — has a conntrack entry that is created and updated on every packet. No flowtable shortcut, no offload to the ASIC.

9.6 — Drivers Loaded vs Drivers Possible

$ devlink dev info
pci/0002:01:00.0:  driver rvu_af
pci/0002:06:00.0:  driver rvu_nicpf
pci/0002:01:00.1-7, 01:01.0-7, 01:02.0:  driver rvu_nicvf  (16 VFs)
pci/0002:20:00.0:  driver rvu_cptpf
pci/0002:02:00.0, 03:00.0, 04:00.0, 05:00.0:  driver rvu_nicpf

These are OcteonTX RVU (Resource Virtualization Unit) drivers — the standard upstream Linux drivers for Marvell's CN10K NIC silicon. The CN10K platform supports DPDK, hardware flow tables, and XDP offload through these drivers when properly configured. None of those features are enabled on the UDM Beast.

The drivers are loaded as plain netdevs with no offload features active — the tc-offload flag fixed-off, no xdp programs attached, no flow_offload infrastructure engaged. The hardware capability is present at the silicon level. The software does not use it.

9.7 — Inter-VLAN Path on the UDM Beast

Putting it all together, the path of an inter-VLAN packet on the UDM Beast is:

Packet arrives on a switch port (e.g., a host on VLAN 10 sending to VLAN 20)
ASIC receives the packet, recognizes it as targeting a different VLAN (or just doesn't have an L3 entry programmed)
ASIC punts the packet to the CPU through the switch0.10 virtual port
Linux bridge br10 receives the punted frame
Kernel routing decision: ip_forward looks up the destination, decides it goes out br20
Packet traverses the iptables FORWARD chains (ALIEN, TOR, IPS, UBIOS_FORWARD_JUMP, UBIOS_PREROUTING_PBR)
Conntrack updates the flow entry
dpi-flow-stats classifies the packet at L7 and may update an ipset
The packet is sent via br20 to switch0.20
ASIC switches the now-tagged-VLAN-20 frame out the destination port

Steps 4 through 9 happen on a single CPU core in softirq context. The ASIC is bypassed for the routing decision; it only handles the L2 hop on either side of the kernel detour.

This is the same architectural pattern as the EFG, with two differences: the CPU is faster (so the bottleneck moves to a higher floor — perhaps 2-4 Gbps single-stream instead of 1-2), and there is a dedicated switching ASIC sitting unused for the routing path.

9.8 — What This Confirms About the Pattern

The EFG findings could be argued away as a one-off — older silicon, an early product, a lapsed kernel build. The UDM Beast diagnostic forecloses that argument:

Different SoC family (CN10K, not CN9K)
Different ARM cores (Neoverse N2, not Marvell-custom)
Newer kernel by 18 months (6.6, not 5.15)
Dedicated switching ASIC present (Prestera-class via PCIe)
Different driver stack (rvu_* family on a 6.6 kernel, with CN10K-era features available)

And yet:

Same iptables architecture
Same conntrack-on-every-packet pattern
Same userspace DPI sitting on the data path
Same Suricata IPS in pcap mode
No switchdev offload (across an entire generation of new silicon)
No flowtable
No DPDK
Software-only L3 path — proven by the not_in_hw filter tags and the 67 GB/70.9M packet counter on the WAN's CPU-mirred chain

This is a multi-generation pattern. Ubiquiti has shipped at least two generations of silicon with substantially different capabilities, both running fundamentally the same software stack, both leaving the silicon's hardware acceleration unused for the inter-VLAN forwarding path. Whatever is preventing the architectural fix is not a hardware constraint and not a kernel-version constraint. It is a software architecture decision that has persisted across product cycles.

The performance gap between the EFG and UDM Beast is roughly the per-core IPC ratio of their CPUs — exactly what you'd predict if the bottleneck were the kernel forwarding path running on one core. A faster CPU moves the floor up. It does not fix the architecture.

9.9 — But Ubiquiti Has Done Hardware-Accelerated Forwarding Elsewhere: The UCG Fiber

A reasonable counter-argument to everything above would be: "Maybe building a hardware-accelerated forwarding integration on a network gateway is just hard, and Ubiquiti hasn't gotten there yet on any product." That argument fails the moment you look at the UCG Fiber.

The UniFi Cloud Gateway Fiber (UCG-Fiber) is one of Ubiquiti's compact desktop gateways, retailing at approximately $279, advertised at "5 Gbps IDS/IPS throughput" with three 10 Gbps ports and four 2.5 Gbps ports. It runs on the Qualcomm IPQ9574 SoC — quad-core ARM Cortex-A73 at 2.2 GHz with 3 GB RAM. A reader of the gist provided diagnostics from a production UCG Fiber:

$ uname -a
Linux UCG-ironionet 5.4.213-ui-ipq9574 #5.4.213 SMP PREEMPT Wed Apr 29 ... aarch64 GNU/Linux

$ ls /usr/share/ubios-udapi-server/
ips/   ips_6/   ips_8/

$ ps -ef | grep suricata
/usr/share/ubios-udapi-server/ips_6/suricata/bin/suricata --pcap 
    --pidfile /run/suricata.pid 
    -c /usr/share/ubios-udapi-server/ips_6/config/suricata_ubios_high.yaml

The Suricata setup is identical to the EFG: same ips_6/ and ips_8/ directory structure with EOL Suricata 6.0.12 active and Suricata 8.0.2 staged-but-unused, same --pcap runtime, same suricata_ubios_high.yaml configuration. The IDS/IPS architecture is portable across product lines.

But the data path is completely different:

$ lsmod | grep nss
qca_nss_sfe         1273856  1 ecm
qca_nss_ppe_lag       20480  0
qca_nss_ppe_ds        24576  0
qca_nss_ppe_qdisc    102400  0
qca_nss_ppe_pppoe_mgr 16384  0
pppoe                 24576  3 qca_nss_sfe,ecm,qca_nss_ppe_pppoe_mgr
qca_nss_ppe_bridge_mgr 32768 0
qca_ovsmgr            45056  3 qca_mcs,ecm,qca_nss_ppe_bridge_mgr
qca_nss_ppe_vlan      49152  2 qca_nss_ppe_lag,qca_nss_ppe_bridge_mgr
qca_nss_ppe_vp        69632  3 qca_nss_ppe_vlan,ecm,qca_nss_ppe_ds
qca_nss_dp           147456  2 qca_nss_ppe_vp,qca_nss_ppe_ds
bonding              135168  3 qca_nss_ppe_vlan,ecm,qca_nss_ppe_pppoe_mgr
qca_nss_ppe          380928  9 qca_nss_dp,qca_nss_ppe_vp,qca_nss_ppe_vlan,...
qca_ssdk            2191360  4 qca_nss_dp,qca_nss_ppe

This is Qualcomm's NSS (Network Sub-System) PPE (Packet Processing Engine) stack — a hardware data path running on a dedicated network coprocessor inside the IPQ9574 SoC, separate from the main ARM cores. The relevant pieces:

qca_nss_ppe — the core PPE driver, the umbrella module
qca_nss_ppe_pppoe_mgr — hardware-offloaded PPPoE session management
qca_nss_ppe_vlan — hardware VLAN tag handling
qca_nss_ppe_bridge_mgr — hardware L2 bridging
qca_nss_ppe_lag — hardware link aggregation
qca_nss_ppe_ds — direct switching (port-to-port without CPU involvement)
qca_nss_ppe_vp — virtual ports (for VLAN sub-interfaces)
qca_nss_ppe_qdisc — hardware queue discipline (QoS in hardware)
qca_nss_sfe — Shortcut Forwarding Engine, the "fast path" that bypasses the kernel for established flows
ecm — Enhanced Connection Manager, the userspace daemon that programs flows into the SFE/PPE
qca_ssdk — SSDK (Switch SDK) — direct switch ASIC programming

pppoe.ko is loaded with three holders (qca_nss_sfe,ecm,qca_nss_ppe_pppoe_mgr), meaning the standard Linux PPPoE module is integrated with the hardware acceleration path.

And it's actually working in production:

$ cat /sys/kernel/debug/qca-nss-ppe/stats/common_stats | grep flows
[v4_l3_flows]: 174
[v4_l2_flows]: 0
[v4_vp_wifi_flows]: 0
[v4_ds_flows]: 0
[v6_l3_flows]: 0
[v6_l2_flows]: 0

174 IPv4 L3 flows are currently offloaded to hardware on this device. The kernel saw the first few packets of each of these flows, ECM programmed the flow into the NSS coprocessor, and the NSS hardware is now forwarding subsequent packets without CPU involvement at all — including (per the loaded modules) flows that traverse VLAN boundaries, that involve PPPoE encapsulation, and that need to be NATed.

This is exactly the architectural pattern the writeup recommends Ubiquiti adopt for the EFG's Octeon silicon: first-packet through the kernel for policy decisions; subsequent packets fast-pathed through hardware-accelerated dataplane workers. On the Octeon, the equivalent is DPDK + VPP using the Marvell-published reference architecture. On the Qualcomm IPQ9574, the equivalent is NSS/PPE + ECM using Qualcomm's reference architecture. The pattern is the same; only the silicon vendor differs.

9.10 — What This Tells Us About the Pattern

The UCG Fiber is the only device in Ubiquiti's gateway portfolio that engages its silicon's hardware acceleration for forwarding. Every other gateway in the lineup — across multiple silicon vendors and multiple SoC generations — has hardware acceleration available but unused.

Device	Price	SoC	Hardware acceleration available?	Hardware acceleration engaged?
UCG Fiber	~$279	Qualcomm IPQ9574	NSS/PPE/SFE/ECM (Qualcomm)	Yes — 174 flows in HW
EFG	~$2,000	Marvell Octeon CN9670	DPDK + hardware NIX engines	No
UDM Beast	varies	Marvell Octeon CN10K + Prestera ASIC	DPDK, switchdev offload, dedicated switch ASIC	No (ASIC unused for L3; `hw-tc-offload: off [fixed]`)
UDM Pro / UDM SE / UDM Pro Max	varies	Various ARM	TSO/RSS/limited offload	No
UCG Max / UCG Ultra / UXG Pro / UXG Lite	varies	Various ARM	TSO/RSS/limited offload	No
UDR / UDR-7 / UDR-5G-Max	varies	Mediatek/Qualcomm	Vendor-specific offload	No (per available teardowns)

The IDS/IPS architecture is identical across all of them — passive Suricata in --pcap, retroactive 3-tuple ipset blocking, EOL Suricata 6.0.12 active with 8.0.2 staged-but-unused. That stack is portable across products and silicon vendors. Ubiquiti has clearly invested in keeping the IDS/IPS architecture consistent across the lineup.

The dataplane integration is the opposite. It got done for exactly one product. The interesting question is why.

The most likely answer is that Ubiquiti ships whatever the silicon vendor's BSP provides, and only the Qualcomm BSP includes a pre-built hardware fast-path.

Qualcomm's IPQ ARM networking platform ships an OpenWrt-based BSP where the NSS/PPE/SFE/ECM stack is pre-integrated into the kernel network stack, ready to use out of the box. The router vendor compiling the BSP gets hardware-accelerated forwarding without doing the dataplane integration themselves — Qualcomm did the integration as part of the BSP. The hardware fast-path "just works" if you ship the BSP as-is.

Marvell's Octeon BSP, by contrast, ships DPDK as a separate userspace SDK. The Octeon kernel BSP gives you the NIC drivers and basic packet I/O, but the high-performance dataplane is a separate layer that the device vendor has to build themselves. Marvell publishes reference architectures (DPDK + VPP + Suricata-on-DPDK) and the silicon supports them, but actually shipping a working DPDK-accelerated gateway requires the vendor to engineer the dataplane application — write or port a control plane, integrate with the management UI, handle config persistence, integrate with the IDS/IPS pipeline, and so on. That's substantially more engineering work than just shipping a BSP.

Same pattern likely applies to the other ARM SoCs in the lineup. If hardware acceleration on a given silicon requires the vendor to build the dataplane, Ubiquiti hasn't built it. If hardware acceleration is pre-built into the BSP, Ubiquiti ships it.

This isn't "Ubiquiti can't build hardware-accelerated dataplanes" — they ship one on the UCG Fiber and it works. It's "Ubiquiti ships whatever the silicon vendor's BSP provides, and doesn't engineer dataplane integration themselves." Where the BSP includes a hardware fast-path, the customer gets one. Where the BSP doesn't — including on the flagship $2,000 EFG and the next-generation UDM Beast — the customer doesn't.

This forecloses the most charitable defense of the EFG's design. The argument would have been: "Building hardware-accelerated forwarding on a network gateway is genuinely difficult, and Ubiquiti hasn't gotten there yet." That argument fails on the UCG Fiber, where they did get there — but only because Qualcomm did the work. The corrected version of the argument would be: Ubiquiti's engineering investment goes into the IDS/IPS pipeline (consistent across products, even on EOL Suricata) and the UI/management plane (consistent across products) — but not into per-silicon dataplane engineering. When the silicon vendor ships a working dataplane in the BSP, customers benefit. When the silicon vendor leaves it as the device vendor's responsibility, Ubiquiti customers get a Linux kernel network stack instead.

The actual situation, then, is even more pointed than "the cheaper product outperforms the flagship." It's: the cheaper product outperforms the flagship because Qualcomm did dataplane engineering Ubiquiti didn't do for Marvell. The performance differential isn't a Ubiquiti achievement on the UCG Fiber — it's a Qualcomm achievement that Ubiquiti benefited from by using their BSP.

The EFG, the UDM Beast, and the rest of the lineup are running silicon whose vendors expected the device builder to engineer the dataplane. Ubiquiti didn't, on any of them.

The performance numbers reflect this. The UCG Fiber advertises 5 Gbps IDS/IPS throughput at $279 because the Qualcomm dataplane is doing the heavy lifting. The EFG, positioned as a flagship and costing ~7× as much, struggles to deliver 1-2 Gbps single-stream inter-VLAN routing — let alone with IPS enabled — because Ubiquiti is running the kernel's general-purpose network stack on the Marvell silicon instead of the dataplane Marvell expected them to build.

10. Findings: The Architectural Failures

Putting together the EFG diagnostics and the lab measurements, the findings are unambiguous.

Finding 1: The kernel network stack on a single core has a ceiling around 5 Gbps single-stream when offloads are off, regardless of NIC

Evidence: virtio-net (4.95 Gbps) and ConnectX VF (4.74 Gbps) measure within experimental error on the same kernel with offloads disabled. The Zen 4 core is identical in both tests. The difference between 4.95 and 4.74 is in the noise.

Implication for the EFG: their 2 GHz Octeon ARM core has its own per-cycle ceiling that's 3-5× slower than Zen 4 for this workload, putting the EFG kernel forwarding ceiling at ~1.0–1.5 Gbps. Reported user numbers match this range. The hardware silicon is not what's limiting them; the per-core kernel stack is.

Finding 2: Hardware offloads (GRO/TSO/LRO) are the single highest-impact configuration variable

Evidence:

virtio kernel forwarding: 4.95 Gbps (off) → 17.4 Gbps with flowtable (on) — 3.5× swing
ConnectX VF kernel forwarding: 4.74 Gbps (off) → 25.3 Gbps (on) — 5.3× swing

EFG state: hw-tc-offload: off [fixed], generic-receive-offload: off. Hard-coded off in the firmware build.

Finding 3: The 5-chain iptables FORWARD pattern costs roughly half your throughput when offloads are also off

Evidence:

virtio-net + offloads off: 4.95 Gbps → 2.36 Gbps when EFG-style rules are applied (52% drop)
ConnectX VF + offloads on: 25.3 Gbps → 21.1 Gbps when applied (17% drop, hidden by GRO)

EFG state: identical rule structure (ALIEN → TOR → IPS → UBIOS_FORWARD_JUMP → user → default). Confirmed by direct iptables diagnostic showing 874 million packets having traversed UBIOS_FORWARD_JUMP in 8 days.

Finding 4: nftables flowtable is the missing 3-7× single-stream multiplier

Evidence:

virtio + EFG rules: 2.36 Gbps → 7.05 Gbps with flowtable added (3.0×)
virtio + flowtable + offloads on: 17.4 Gbps (7.4× over 2.36 baseline)

EFG state: nf_flow_table module not loaded. nft list flowtables is empty. The kernel module isn't even installed on the device. This is a one-line configuration change in nftables that Ubiquiti could ship and immediately triple single-stream inter-VLAN performance.

Finding 5: Conntrack helpers are loaded by default, and the per-packet cost is widely misunderstood

The popular description: "Every packet is inspected by every loaded helper." This is approximately wrong. The actual cost depends on which phase of a flow the packet belongs to.

Phase 1 — New connection (SYN packet, first packet of a flow):

When conntrack creates a new entry for a flow, it walks nf_ct_helper_hash — a hash table keyed by L4 protocol + port — to determine if any registered helper applies. For TCP/21 (FTP control), it finds the FTP helper and attaches it to the conntrack entry. For TCP/443 (HTTPS), it finds nothing and attaches no helper. The per-new-connection cost is one hash lookup against the helper registry. Small but real.

This phase also touches nf_ct_expect_hash — the expectations table — to check if this new flow matches a previously-expected data connection (e.g., the data port that an active FTP control session announced via PORT or PASV). Empty expectations table = essentially zero cost; an active expectations table = small additional lookup.

Phase 2 — Established flow (every subsequent packet):

Once a flow has a conntrack entry, the per-packet helper logic in nf_conntrack_in() reads:

help = nfct_help(ct);          // pointer load from conntrack entry
if (help && help->helper)       // both NULL for non-helper flows
    help->helper->help(skb, ct, ...);

For a flow with no helper attached — the vast majority of traffic, since helper-relevant ports are rare — this is two pointer loads and a branch. Modern CPUs predict the not-taken branch perfectly. The cost on non-helper flows is essentially zero.

For flows that DO have a helper attached (e.g., active FTP control connection, ongoing SIP call), the helper's ->help() callback runs on every packet to inspect for protocol events (PORT command, RTP setup, etc.). This is genuine per-packet cost, but it only applies to flows on helper-recognized ports.

Why iperf3 throughput doesn't change when helpers are disabled: An iperf3 inter-VLAN test uses a single TCP connection on iperf3's port (5201 by default). That port is not a helper-recognized port. The connection has no helper attached. Phase 2's two-pointer-load-and-branch is essentially free. Disabling helpers via the UI removes the modules from memory, eliminating Phase 1 lookup cost on new connections — but it does not change anything in Phase 2 for non-helper flows.

Why helpers nonetheless matter at scale: An enterprise router doing ~10,000 new connections per second — driven by lots of short HTTP requests, DNS resolutions, and other transient flows — pays the Phase 1 helper-hash-lookup tax 10,000 times per second. Removing helpers eliminates that. It's not a per-packet win on data flows, it's a per-new-connection win.

The proper fix is not removing helpers: a correctly-architected router uses the netfilter flowtable for the data path. With flowtable, established flows bypass the entire netfilter chain (helpers included) and go through the offloaded fast path. Helpers continue to run on connection setup and on the control connection of helper protocols (e.g., FTP control), but the data connection of those protocols can be offloaded. You get full helper functionality and zero per-packet cost on data flows, simultaneously. This is what mainstream Linux distributions ship in 2026.

The EFG's kernel does not have flowtable compiled in (Section 12).

Four implementation approaches that would do this correctly:

nftables with explicit per-flow helper attachment (the modern, correct approach). Helpers attach only to flows matching explicit nftables rules — no global helper auto-attach, zero cost for any flow not matching the rule. Requires migrating from iptables to nftables.
Userspace conntrack helpers via netlink (kernel 3.6+). The kernel forwards control packets to a userspace daemon, which parses protocols and inserts expectations back via netlink. Pros: kernel stays small, helper bugs don't crash the kernel, helpers can be updated independently of the kernel. Cons: control-plane latency increase.
Don't NAT helper-protocol traffic at all. Modern protocols handle NAT traversal in the application layer (FTP passive mode, SIP+STUN/ICE, WebRTC). The kernel doesn't need to do ALG. Most enterprise gateways in 2026 have moved this direction; kernel helpers are legacy.
Keep helpers, add flowtable (the practical fix for an existing iptables-based system). Helpers run on connection setup and helper-protocol control channels; flowtable handles the data path of every other flow. Best compatibility with existing rule sets.

EFG state: A "Firewall Connection Tracking" toggle in the UniFi controller's Gateway settings exposes individual checkboxes for FTP, H.323, SIP, GRE, PPTP, and TFTP helpers. Disabling them all unloads the helper modules entirely — which addresses Phase 1 lookup overhead on new connections but does nothing for the bigger architectural issues. The toggle's existence confirms that Ubiquiti's engineering team is aware that helpers cost something. They have implemented a partial fix (the toggle) instead of the proper fix (flowtable). The proper fix would require shipping nf_flow_table.ko, which they have chosen not to do (Section 12).

Finding 6: Multiple cores do not help single-flow forwarding in the kernel

Evidence: 8 parallel streams with the EFG ruleset reach 11.4 Gbps aggregate (~1.4 Gbps per stream). Single stream caps at 2.36 Gbps. The EFG's mpstat shows all 18 cores idle except the one with the active flow.

EFG state: 18 cores, but RSS hashes a single TCP 5-tuple to one queue, which binds to one core. Adding cores to a kernel-based router cannot fix single-flow performance. Faster per-core, fewer per-packet steps, hardware offload, or a userspace dataplane (which can poll across worker cores) can.

Finding 7: DPDK on the same silicon delivers 10-25× the throughput, and the vendor ships full DPDK support

Evidence:

Lab VPP/DPDK on ConnectX with offloads: 35.6 Gbps single-stream (15× over the EFG-style baseline)
Marvell's published cnxk PMD benchmarks: 18-36 Gbps single-core on CN9670-class silicon
Suricata 7.0+: native DPDK input mode shipped 2023
VPP: native cnxk plugin shipped 2020
The full reference architecture (DPDK + VPP + Suricata-on-DPDK) is published by Marvell and field-deployed by NFV vendors

EFG state: zero DPDK. The cnxk PMD is not loaded. Suricata runs in pcap mode (per-packet kernel→userspace copy) instead of DPDK mode. Ubiquiti would lose nothing by adopting DPDK — their primary inspection workload (Suricata) supports it, their silicon vendor supports it, and the resulting performance on the same hardware would be 10-25× higher.

Finding 8: Userspace inspection processes contend with the forwarding core, with the contention pattern depending on per-process CPU pinning

Evidence (EFG): dpi-flow-stats at 39.6% CPU (with no CPU pinning — affinity mask 0x3ffff = all 18 cores allowed) + Suricata-Main at 6.9% + conntrackd at 7.0%. Suricata is configured with explicit per-thread pinning: management thread on core 0, verdict thread on core 1, workers across all cores (Section 4.5 details).

The single-stream contention story specifically: A single TCP flow's softirq lands on whichever core RSS hashes the 5-tuple to (typically core 0 by default on the EFG). On that core: Suricata's management thread is pinned there, Suricata workers may be scheduled there, and dpi-flow-stats can run there with no restriction. All of these contend with forwarding softirq for the same core's cycles.

Evidence (lab): a deliberate spinner pinned to a non-forwarding core had no effect on single-stream throughput (correctly isolated). When CPU contention is on the forwarding core, throughput drops proportionally.

Implication: even if Ubiquiti fixed every other issue, single-stream throughput would still depend on which userspace processes happen to land on the same physical core as the active flow's softirq. Mitigations include: explicit taskset/cgroup pinning of dpi-flow-stats off the dominant RSS hash core; relocating Suricata's management-cpu-set off core 0 (one-line YAML change); RSS reconfiguration to hash flows away from the cores Suricata pins to. The proper fix is Suricata in DPDK mode on dedicated worker cores (supported since Suricata 7.0, 2023), which moves all per-packet inspection out of the kernel path entirely.

Finding 9: Per-VLAN bridges instead of vlan-aware single bridge prevent kernel fast-path optimization

Evidence (EFG): br0, br3, br5, br6, br7, br254, br1111 — one bridge per VLAN. Inter-VLAN traffic must traverse multiple bridge hops plus a kernel L3 lookup.

Lab equivalent: vmbr1 with VLAN-aware mode and bridge VID filtering allows a single bridge to handle all VLANs. With flowtable on top, established flows skip the bridge slow path entirely.

Implication: even without flowtable, switching to a vlan-aware bridge architecture would simplify the data path and enable bridge VID hardware offload paths that the current per-bridge structure cannot use.

Finding 10: PPPoE WAN performance is bottlenecked by the same kernel stack, with additional encapsulation cost — and worse multi-core spread

Evidence (deployment reports): enterprise customers on 10 Gbps PPPoE fiber consistently report 2-3 Gbps single-stream WAN throughput on the EFG.

Evidence (live capture during a Netflix Fast.com test on a production EFG): six different ksoftirqd kernel threads simultaneously consuming 55-100% CPU (cores 0, 2, 7, 10, 12, 14), with concurrent userspace inspection load (Suricata 44%, ubios-udapi-ser 22%, unifi-core 16%) competing for the same cores. The PPPoE encap/decap path forces multiple kernel-stack passes per packet, each potentially landing on a different core, multiplying total CPU consumption while not improving single-flow throughput.

Evidence (mainline Linux): kernel 6.2+ ships PPPoE handling within nf_flow_table.ko — the protocol checks and helpers (nf_flow_pppoe_proto, __nf_flow_pppoe_proto, ETH_P_PPP_SES matching) are inline within nf_flow_table_ip.c, nf_flow_table_inet.c, and nf_flow_table_offload.c, all of which compile into the existing nf_flow_table.ko and nf_flow_table_inet.ko modules. The EFG runs kernel 5.15. The nf_flow_table and nf_flow_table_inet modules are not even compiled into Ubiquiti's kernel build — modinfo returns "Module not found" for both, meaning the entire flowtable infrastructure (including any PPPoE acceleration) is absent.

Implication: PPPoE WAN performance is not a hardware limitation. It is the same per-core kernel ceiling as inter-VLAN routing, with an additional encapsulation layer that mainline Linux now supports accelerating, and a multi-pass softirq pattern that is more expensive than plain inter-VLAN forwarding. The fix is a kernel rebase plus the same flowtable directive — or DPDK + accel-ppp + VPP, which Marvell publishes as a reference architecture for this exact silicon.

Finding 11: The EFG's kernel is binary-incompatible with vanilla 5.15.72 despite identifying as such, and the safety net that would catch this is disabled

Evidence: We cross-compiled nf_tables, nf_flow_table, and nf_flow_table_inet from vanilla linux-5.15.72.tar.xz (kernel.org), using the EFG's exposed /proc/config.gz as the build configuration. The resulting modules report a vermagic string identical character-for-character to the EFG's existing in-tree modules: 5.15.72-ui-cn9670 SMP mod_unload aarch64. Loading nf_tables.ko on the device caused an immediate kernel panic (NULL pointer dereference at virtual address 0x120 during module init), forcing a watchdog reboot.

Evidence (config audit):

$ zcat /proc/config.gz | grep -E "MODVERSIONS|TRIM_UNUSED_KSYMS|MODULE_SIG"
CONFIG_HAVE_ASM_MODVERSIONS=y
# CONFIG_MODVERSIONS is not set
# CONFIG_TRIM_UNUSED_KSYMS is not set
[no CONFIG_MODULE_SIG entries]

CONFIG_MODVERSIONS would have caught the binary incompatibility at load time with a clean error message. It is disabled. CONFIG_MODULE_SIG (cryptographic module signing) is not even built into the kernel. lockdown is not enabled. The root filesystem is writable via overlay.

Implication: Two findings, both serious.

First, the EFG's kernel is not actually vanilla 5.15.72 even though it identifies as 5.15.72-ui-cn9670 and reports the upstream version. Ubiquiti has applied undisclosed patches that change netfilter's internal data structures or function signatures. Customers who attempt to enable missing kernel features by building from the announced upstream tag will produce modules that load (because vermagic matches) but crash (because the real ABI doesn't). This is exactly why the GPL exists — it requires vendors to publish the complete corresponding source so customers can rebuild against the actual kernel they received, not the vanilla one it claims to be.

Second, the security configuration is unusually permissive for an enterprise security product: no module signing, no kernel lockdown, no symbol-CRC verification, writable root via overlay. Any process that becomes root can load arbitrary unsigned, unverified kernel modules with no cryptographic check. Combined with the binary-incompatible-but-not-detected ABI, this is a pathway for both accidental crashes and deliberate exploitation.

A GPL source request was filed with opensource-requests@ui.com at the time of this writing. Until it is fulfilled, even a customer with full root access on hardware they own cannot enable the missing performance features safely. Section 12 documents this experiment in detail.

11. Recommended Fixes

The findings above translate directly to a list of prioritized configuration changes Ubiquiti could ship. None of these require new hardware. All are available in mainline Linux or as vendor-supported infrastructure from Marvell. Several are config changes that do not even require a kernel update.

Fix 1 (Highest Impact, Lowest Effort): Enable nftables flowtable

What: Load the nf_flow_table kernel module and add a flowtable directive to the active nftables ruleset. The hook is software-only (no hardware offload required) and works on any modern kernel (5.4+).

Configuration sketch:

table inet filter {
    flowtable f {
        hook ingress priority 0
        devices = { eth_lan_vlan10, eth_lan_vlan20, ... }
    }

    chain forward {
        type filter hook forward priority 0; policy accept;
        ip protocol { tcp, udp } flow add @f
        ct state established,related accept
        ... existing security rules ...
    }
}

Measured improvement: 2.36 → 7.05 Gbps single-stream (3.0×) on virtio. Combined with offloads enabled: 17.4 Gbps (7.4×).

Trade-off: Flows in the fast-path bypass conntrack and rule evaluation. Security rules must be applied to the first few packets of a flow, before it's offloaded. Existing iptables/nftables rules continue to work; only established flows are accelerated. The IPS / DPI processes that need every packet would need to be moved to a different inspection point (e.g., promiscuous tap on the bridge, or sFlow sampling) — but most of them only need flow-level visibility, which conntrack already provides.

Fix 2 (Highest Impact on Hardware): Enable hardware offloads (GRO/TSO/LRO/hw-tc-offload)

What: Stop hard-coding hw-tc-offload off [fixed]. Enable GRO and TSO on the kernel side. On the Octeon CN9670 (and CN10K on the UDM Beast), enable the NIX hardware acceleration path — these are first-party Marvell engines designed to forward packets without ARM core involvement.

Measured improvement: 4.74 → 25.3 Gbps single-stream (5.3×) on ConnectX VF with kernel forwarder when offloads enabled. The same pattern applies to any NIC with hardware-accelerated forwarding, including the Octeon NIX.

Trade-off: Hardware offload paths typically require the kernel and the device firmware to agree on which features can be offloaded. Some advanced features (like complex iptables matchers) can't be offloaded; the kernel falls back to software for those packets. This is a graceful degradation, not a failure — the fast path handles the common case, slow path handles edge cases. Modern flowtable in switchdev mode (which ConnectX-6 Dx and Octeon CN9670 both support) hands established TCP/UDP flows directly to silicon.

Fix 3 (The Biggest Architectural Win): Adopt DPDK + VPP for the dataplane

What: Migrate the forwarding plane from kernel ip_forward to VPP with the Marvell-supported cnxk DPDK PMD. Move Suricata to its native DPDK mode (available since Suricata 7.0). Pin VPP worker threads and Suricata workers to dedicated CPU cores, leaving the control plane (UniFi management, control plane protocols, dpi-flow-stats summaries) on a separate core.

Why this is the biggest win: Marvell publishes complete DPDK + VPP reference architectures for the OCTEON family. The cnxk PMD is open-source, well-maintained, and ships with mainline DPDK. Suricata's DPDK mode is production-deployed by major NFV vendors. Every component Ubiquiti needs is already vendor-supported, mainline open-source software. They lose nothing by adopting it.

Estimated improvement on EFG silicon:

Single-stream inter-VLAN: from ~1 Gbps to 15-25 Gbps (15-25×)
PPPoE WAN single-stream: from ~3 Gbps to 8-10 Gbps (line rate on 10G PPPoE)
Aggregate: from a few Gbps to line rate on both 25G ports (50 Gbps)
Inspection (Suricata): from kernel-pcap mode to DPDK direct, eliminating per-packet kernel→userspace copy

Trade-off: Largest engineering investment of any fix. Ubiquiti would need to rewrite their forwarding plane on top of VPP's API and integrate VPP's CLI/API with their UniFi controller. However, all the heavy lifting (the PMD, the dataplane, the Suricata DPDK integration) already exists. They are integrating, not inventing.

Fix 4 (Architectural): Move from per-VLAN bridges to a single VLAN-aware bridge

What: Replace br0, br3, br5, ... with a single bridge in bridge_vlan_filtering=1 mode, with VID assignments per port. Combined with nf_flow_table on the same bridge, this enables flowtable to short-circuit established flows entirely within the bridge layer.

Measured improvement: Indirect — enables Fix 1 and Fix 2 to be more effective, particularly for inter-VLAN flows that today must traverse multiple bridges. Direct measurements not made in this study, but Linux upstream has documented order-of-magnitude improvements in similar setups.

Trade-off: Configuration migration. Existing ruleset references to specific bridge devices need updating to reference the unified bridge. Manageable as a firmware update.

Fix 5 (CPU Hygiene): Pin userspace inspection processes off the dominant data-path cores

What: Use cgroup, systemd CPUAffinity=, or taskset to ensure dpi-flow-stats (currently unrestricted, allowed on all 18 cores) is pinned to cores that aren't on the dominant RSS hash path for inter-VLAN traffic. Separately, in /usr/share/ubios-udapi-server/ips_6/config/suricata_ubios_high.yaml, move management-cpu-set from [ 0 ] to a higher core (e.g., [ 2 ]) so the management thread doesn't contend with single-flow forwarding softirq on core 0. Additionally, RSS could be reconfigured to hash inter-VLAN flows away from cores 0 and 1 (which Suricata already pins to for management and verdict threads).

Measured improvement: Indirect, on the order of 10-20% on single-stream throughput, because it frees the specific core that's bottlenecking that flow from cycle competition. Larger benefit on systems where a flow lands on core 0 (the default) by changing where its competitors live.

Trade-off: None of consequence. This is basic Linux performance hygiene that any production router enables. The cost is a few sysfs/systemd-cgroup changes plus one YAML edit. Becomes moot after Fix 3 (with DPDK, each Suricata/dataplane worker has its own dedicated core by design).

Fix 6 (Modern API): Migrate from iptables to native nftables

What: The current ruleset is on the legacy iptables (xt_*) backend with 839 rules. Native nftables is faster per-rule, supports flowtable natively (Fix 1 builds on this), supports atomic ruleset replacement (no flushing), and is the future of Linux netfilter.

Measured improvement: Single-digit percentage points on its own; enables Fix 1 to reach its full potential.

Trade-off: Migration cost. Tools like iptables-translate automate most of it. The tools that produce the existing ruleset (presumably internal Ubiquiti config generators) need to emit nft syntax instead.

Fix 7 (Already partially shipped): Conntrack helper toggle

What: The UniFi controller already exposes a "Firewall Connection Tracking" control in Gateway settings, with checkboxes for FTP, H.323, SIP, GRE, PPTP, and TFTP helpers. Enterprise deployments without those legacy protocols can disable them all to unload the helper modules entirely.

What this actually does: Removes Phase 1 helper-hash-lookup overhead on new connections (see Section 10 Finding 5). On a router doing tens of thousands of new connections per second, this is a meaningful reduction in connection-setup CPU cost.

What this does NOT do: It does not change throughput on already-established TCP flows like an iperf3 test. The Phase 2 per-packet cost on non-helper flows is essentially zero whether helpers are loaded or not. iperf3 inter-VLAN single-stream throughput is unchanged.

Why this is a partial fix: The architecturally correct answer is to use the kernel's flowtable for the data path so that established flows bypass the entire netfilter chain (helpers and all) at line rate, while helpers continue to handle the control connections of legitimate helper-protocol traffic. That requires shipping nf_flow_table.ko, which the EFG does not have (Section 12). The toggle's existence is evidence that Ubiquiti's engineering team understands the helpers-cost-something question; they have shipped a partial mitigation rather than the proper fix.

Recommended action for administrators: If your deployment doesn't use FTP active-mode NAT, H.323 video conferencing, SIP through ALG (most modern SIP deployments use STUN/ICE instead), PPTP VPN, or TFTP, disable all of them. It's a free win on connection-setup costs.

Fix 8 (Long-term): Ship a newer kernel

What: Linux 5.15 LTS dates from late 2021. Kernel 6.6 LTS (the current LTS) includes substantial nftables, flowtable, bridge improvements, and PPPoE flowtable acceleration handled inline within nf_flow_table.ko (added in kernel 6.2+). Kernel 6.12 LTS includes hardware-offloaded flowtable for several NICs and improved per-CPU optimizations.

Measured improvement: Compounding with Fix 1, Fix 2, and the PPPoE acceleration. Recent kernels have made nf_flow_table faster per-packet, made hardware-offload setup easier, and added PPPoE-specific acceleration that the EFG completely lacks today.

Trade-off: Vendor kernel update. The Octeon vendor BSP (Marvell's "ubuntu-cn9670") will need to be rebased on a newer kernel. Not trivial but routine for a hardware vendor; Marvell themselves publish 6.x-based BSP releases.

Fix Priority Ranking

Priority	Fix	Effort	Single-stream improvement
1	Enable flowtable	Low (config)	3.0×
2	Enable hardware offloads	Low–Medium (config + firmware)	up to 5.3×
3	Adopt DPDK + VPP + Suricata-DPDK	High (engineering)	15-25× — and fixes PPPoE too
4	Newer kernel (5.15 → 6.6+)	Medium	enables PPPoE flowtable, +small kernel gains
5	Pin inspection processes off data-path core	Low (config)	small but additive
6	Per-VLAN bridges → vlan-aware single bridge	Medium (config migration)	enables 1+2
7	iptables → nftables	Medium	enables 1, small direct
8	Conntrack helper toggles (already shipped — disable in UI)	Free (UI checkbox)	none on iperf3, small on connection setup

Doing Fix 1 alone gets you 3× the single-stream throughput. Fix 1+2 gets you 7×. Fix 3 — the long-term architectural fix that the silicon vendor literally publishes a reference architecture for — gets you 15-25×. The hardware does not need to change.

12. Direct Experimental Verification — Building the Missing Modules

The analysis to this point rests on lab measurements made on x86 hardware that reproduces the EFG's software stack. The lab data is reproducible and self-consistent, but a fair reader can ask: would the recommended fixes actually work on the real device?

To find out, we attempted the most surgical of the recommended fixes — adding the missing nftables flowtable kernel modules — to a production EFG. The exercise was instructive in ways we did not anticipate, and the results materially strengthen Section 10's findings about the EFG's kernel.

What follows is a complete, honest record of the attempt. Both attempts ultimately crashed the device. Neither outcome was the desired success path, but the failure modes themselves are diagnostic — they reveal precisely how far Ubiquiti's kernel diverges from any reproducible public source.

12.1 — Feasibility Assessment

Loading a third-party kernel module into a running kernel requires a few prerequisites:

A matching kernel version (vermagic). The Linux module loader rejects any module whose vermagic string doesn't match the running kernel's exactly.
Module loading not blocked by signing. If CONFIG_MODULE_SIG_FORCE=y or module.sig_enforce=1, only modules signed by an in-kernel trusted key can load.
No kernel lockdown. If a Secure Boot lockdown is engaged, module loading from disk is restricted regardless of signing config.
A writable filesystem location, since module files must be readable from disk by init_module(2) or finit_module(2).

We confirmed each on a production EFG via SSH:

$ cat /proc/cmdline
console=ttyAMA0,115200n8 earlycon=pl011,0x87e028000000 maxcpus=18 isolcpus=12 
rootwait rw coherent_pool=16M pcie_aspm=off net.ifnames=0 sysid=ea3d 
root=PARTUUID=...

No module.sig_enforce=1. No lockdown= argument. No lsm=lockdown,....

$ cat /sys/module/module/parameters/sig_enforce
N

Module signing not enforced.

$ zcat /proc/config.gz | grep -E "^CONFIG_(MODULE_SIG|SECURITY_LOCKDOWN|MODVERSIONS|TRIM_UNUSED_KSYMS)"
# CONFIG_MODULE_SIG is not set
# CONFIG_MODVERSIONS is not set
# CONFIG_TRIM_UNUSED_KSYMS is not set
# CONFIG_SECURITY_LOCKDOWN_LSM is not set

This was both encouraging and concerning. Encouraging because it meant we had a clean path to load a custom-built module if we could match vermagic. Concerning because these missing options are exactly the safeguards a production firmware should have:

MODULE_SIG: prevents loading unsigned modules. Any process with CAP_SYS_MODULE (root, in containers if not seccomp'd) can load arbitrary kernel code.
MODVERSIONS: adds CRC checksums to every exported symbol. A module built against a kernel with subtly different struct layouts will be refused at load time rather than crashing the kernel later.
TRIM_UNUSED_KSYMS: limits the surface area of exposed kernel symbols.
SECURITY_LOCKDOWN_LSM: restricts what root can do to a running kernel.

The implications of these absences are explored further in Section 10, Finding 11. For the experiment, they meant that load-time symbol mismatches would not be caught — the kernel would happily start executing code with bad assumptions about struct layouts.

The EFG's filesystem is overlayfs root with a writable upper layer at /mnt/.rwfs/data. Modules placed in /tmp survive long enough to load.

The flowtable modules (nf_flow_table.ko, nf_flow_table_inet.ko, plus nf_tables.ko as a dependency) are absent from the EFG's /lib/modules/:

$ find /lib/modules/$(uname -r) -name 'nf_flow_table*' -o -name 'nf_tables.ko'
[no output]

$ modinfo nf_flow_table
modinfo: ERROR: Module nf_flow_table not found

The modules are not merely disabled; they are not present in the build. We needed to compile them ourselves.

12.2 — Cross-Compilation Setup, Attempt 1: Vanilla 5.15.72

A separate build VM was provisioned on the lab host:

Ubuntu 24.04 LTS, 16 vCPU, 32 GB RAM
gcc-10-aarch64-linux-gnu 10.5.0 from the noble-universe repository (matches the EFG's compiler family)
Linux 5.15.72 source tree from kernel.org

The EFG's running kernel reports itself as:

$ uname -r
5.15.72-ui-cn9670

$ uname -a
Linux EFG-Home-SP 5.15.72-ui-cn9670 #5.15.72 SMP Wed Apr 15 23:39:47 CST 2026 
aarch64 GNU/Linux

$ strings /lib/modules/5.15.72-ui-cn9670/kernel/net/netfilter/nf_conntrack_ftp.ko \
    | grep -E '^(vermagic|name)='
vermagic=5.15.72-ui-cn9670 SMP mod_unload aarch64
name=nf_conntrack_ftp

The build process:

$ export ARCH=arm64
$ export CROSS_COMPILE=aarch64-linux-gnu-
$ export CC=aarch64-linux-gnu-gcc-10

$ cd ~/efg-build/vanilla-5.15.72/linux-5.15.72
$ cp ~/efg-build/efg-running.config .config

# Set CONFIG_LOCALVERSION inside the .config (not the env)
$ ./scripts/config --set-str CONFIG_LOCALVERSION "-ui-cn9670"

# Enable the modules we want to build
$ ./scripts/config --module CONFIG_NF_TABLES
$ ./scripts/config --module CONFIG_NF_FLOW_TABLE
$ ./scripts/config --module CONFIG_NF_FLOW_TABLE_INET

# Disable BTF generation (would require pahole on EFG kernel — not available)
$ ./scripts/config --disable CONFIG_DEBUG_INFO_BTF

# Reconcile
$ make olddefconfig

$ time make -j$(nproc) modules
real    1m52s

$ for ko in net/netfilter/nf_tables.ko \
            net/netfilter/nf_flow_table.ko \
            net/netfilter/nf_flow_table_inet.ko; do
    strings $ko | grep -E '^(vermagic|name)='
  done
vermagic=5.15.72-ui-cn9670 SMP mod_unload aarch64
name=nf_tables
vermagic=5.15.72-ui-cn9670 SMP mod_unload aarch64
name=nf_flow_table
vermagic=5.15.72-ui-cn9670 SMP mod_unload aarch64
name=nf_flow_table_inet

Three modules. All vermagic strings byte-perfect matches for the EFG kernel.

12.3 — The Vanilla Build Crashed the Device

The modules were copied to the EFG and loading was attempted in dependency order:

$ scp nf_tables.ko nf_flow_table.ko nf_flow_table_inet.ko \
      root@efg-prod:/tmp/

$ ssh root@efg-prod
# cd /tmp
# insmod ./nf_tables.ko
[connection drops, device reboots]

The kernel oops, captured before the watchdog reboot:

[ ... ] Unable to handle kernel NULL pointer dereference at virtual address 0x120
[ ... ] Mem abort info:
[ ... ]   ESR = 0x96000004
[ ... ]   FSC = 0x4: level 0 translation fault
[ ... ] Internal error: Oops: 96000004 [#1] SMP
[ ... ] Modules linked in: nf_tables(+) wireguard libchacha20poly1305 ...
[ ... ] CPU: 3 PID: 211748 Comm: insmod Tainted: P W O 5.15.72-ui-cn9670 #5.15.72
[ ... ] Hardware name: Marvell OcteonTX CN96XX board (DT)
[ ... ] pc : nf_tables_init_net+0x18/0x94 [nf_tables]
[ ... ] lr : ops_init+0x3c/0x120
[ ... ] Call trace:
[ ... ]  nf_tables_init_net+0x18/0x94 [nf_tables]
[ ... ]  ops_init+0x3c/0x120
[ ... ]  register_pernet_operations+0xec/0x240
[ ... ]  register_pernet_subsys+0x2c/0x50
[ ... ]  nf_tables_module_init+0x24/0x100 [nf_tables]

The HA secondary in the home cluster failed over within ~8 seconds. Service was restored without operator intervention.

The crash happened at byte 24 of the function nf_tables_init_net — extremely early in the per-network-namespace initialization. nf_tables_init_net is one of the very first things register_pernet_subsys calls when the module starts up. It tries to read a field at offset 0x120 from a struct pointer that the kernel allocated, and the kernel handed back a struct that doesn't have a valid pointer at that offset.

This isn't a "missing symbol" error or a "wrong function signature" error. The module loaded successfully. Its symbols resolved against the running kernel's symbol table. Execution started. And then, within microseconds, it dereferenced a struct field at an offset where the running kernel doesn't have what our module expected.

That's an ABI mismatch — the structure layout in our build's view of the kernel is different from the structure layout in the EFG's running kernel.

12.4 — Why Vanilla 5.15.72 Crashed

The crash happens because:

# CONFIG_MODVERSIONS is not set
# CONFIG_TRIM_UNUSED_KSYMS is not set

Without MODVERSIONS, the kernel module loader has no per-symbol CRC to compare. Vermagic only checks "this is kernel 5.15.72-ui-cn9670 SMP aarch64" — it doesn't say "the struct net has a particular field at offset 0x120." If the EFG's nf_tables_pernet struct has a different field count than vanilla's, the build still produces a module that loads cleanly. It just crashes when execution hits a misaligned access.

This means either:

(a) Ubiquiti rebased Linux 5.15.72 on top of patches from a different kernel version, OR
(b) Ubiquiti or a vendor (Marvell) added fields to internal structures that vanilla 5.15.72 doesn't have, OR
(c) Both.

Section 12.5 below addresses (b) directly by attempting to build against Marvell's complete published BSP — the largest plausible source of vendor-specific kernel patches for this silicon.

12.5 — Cross-Compilation Setup, Attempt 2: Marvell OCTEON BSP

The Marvell OCTEON CN9670 SoC has substantial vendor-specific Linux support that is not in mainline. Marvell maintains kernel patches for their hardware engines (NIX network units, RVU resource virtualization, NPA packet allocator, SSO event scheduler, CPT crypto), and these patches frequently touch core kernel infrastructure including netfilter (where Marvell integrates hardware flow offload acceleration).

Marvell publishes their kernel patches through the Yocto Project's linux-yocto repository, branch v5.15/standard/cn-sdkv5.15/octeon, maintained by Bo Sun (Marvell engineer) and merged by the Yocto Project's kernel maintainer (Bruce Ashfield). This is a public, GPL-licensed source tree.

$ git clone https://git.yoctoproject.org/linux-yocto.git linux-yocto-cnxk-5.15
$ cd linux-yocto-cnxk-5.15
$ git checkout v5.15/standard/cn-sdkv5.15/octeon

$ head -5 Makefile
# SPDX-License-Identifier: GPL-2.0
VERSION = 5
PATCHLEVEL = 15
SUBLEVEL = 203
EXTRAVERSION =

The branch HEAD is at 5.15.203 (a stable update) with the full Marvell OCTEON CN9K patch set applied on top.

Examination of the source tree shows the BSP modifies sixteen netfilter-related header files compared to vanilla Linux 5.15.72:

$ for f in $(find ~/vanilla-5.15.72/include -name "*netfilter*" -o -name "*nf_*"); do
    rel=${f#*/include/}
    bsp=~/linux-yocto-cnxk-5.15/include/$rel
    if [ -f "$bsp" ] && ! diff -q "$f" "$bsp" >/dev/null 2>&1; then
      echo "DIFFERS: $rel"
    fi
  done

DIFFERS: net/netfilter/nf_conntrack.h
DIFFERS: net/netfilter/nf_conntrack_count.h
DIFFERS: net/netfilter/nf_conntrack_timeout.h
DIFFERS: net/netfilter/nf_flow_table.h
DIFFERS: net/netfilter/nf_nat_redirect.h
DIFFERS: net/netfilter/nf_tables.h
DIFFERS: net/netfilter/nf_tables_core.h
DIFFERS: net/netfilter/nf_tproxy.h
DIFFERS: net/netns/netfilter.h
DIFFERS: linux/netfilter.h
DIFFERS: linux/netfilter_defs.h
DIFFERS: linux/netfilter/nf_conntrack_sctp.h
DIFFERS: uapi/linux/netfilter_bridge.h
DIFFERS: uapi/linux/netfilter/nf_conntrack_common.h
DIFFERS: uapi/linux/netfilter/nf_conntrack_sctp.h
DIFFERS: uapi/linux/netfilter/nf_tables.h

Several of these headers contain function-signature changes that explain why a vanilla-built module would crash. For example, in nf_conntrack_count.h:

-unsigned int nf_conncount_count(struct net *net,
-                                struct nf_conncount_data *data,
-                                const u32 *key,
-                                const struct nf_conntrack_tuple *tuple,
-                                const struct nf_conntrack_zone *zone);
+unsigned int nf_conncount_count_skb(struct net *net,
+                                    const struct sk_buff *skb,
+                                    u16 l3num,
+                                    struct nf_conncount_data *data,
+                                    const u32 *key);

The function was renamed, and its signature changed. In nf_flow_table.h:

-int flow_offload_route_init(struct flow_offload *flow,
-                            const struct nf_flow_route *route);
+void flow_offload_route_init(struct flow_offload *flow,
+                             struct nf_flow_route *route);

Return type changed from int to void; const removed from the route argument.

The same header backports a feature from kernel 6.2 — PPPoE flowtable acceleration — into 5.15:

+static inline bool nf_flow_pppoe_proto(struct sk_buff *skb, __be16 *inner_proto)
+{
+    if (!pskb_may_pull(skb, ETH_HLEN + PPPOE_SES_HLEN))
+        return false;
+
+    *inner_proto = __nf_flow_pppoe_proto(skb);
+    return true;
+}

This last item is significant: Marvell's BSP includes a PPPoE flowtable backport that mainline 5.15 does not have. If we can build a module against this BSP and load it on the EFG, we should — in principle — get not only inter-VLAN flowtable acceleration but PPPoE flowtable acceleration as well.

The build:

$ cd linux-yocto-cnxk-5.15

# Force SUBLEVEL=72 to match EFG vermagic (BSP HEAD is 5.15.203)
$ sed -i 's/^SUBLEVEL = .*/SUBLEVEL = 72/' Makefile

# Suppress kbuild dirty marker
$ touch .scmversion

# Apply EFG running config and target modules
$ cp ~/efg-build/efg-running.config .config
$ ./scripts/config --set-str CONFIG_LOCALVERSION "-ui-cn9670"
$ ./scripts/config --module CONFIG_NF_TABLES
$ ./scripts/config --enable CONFIG_NF_TABLES_INET
$ ./scripts/config --enable CONFIG_NF_TABLES_IPV4
$ ./scripts/config --enable CONFIG_NF_TABLES_IPV6
$ ./scripts/config --module CONFIG_NF_FLOW_TABLE
$ ./scripts/config --module CONFIG_NF_FLOW_TABLE_INET
$ ./scripts/config --enable CONFIG_NF_FLOW_TABLE_IPV4
$ ./scripts/config --enable CONFIG_NF_FLOW_TABLE_IPV6
$ ./scripts/config --disable CONFIG_DEBUG_INFO_BTF
$ ./scripts/config --disable CONFIG_MODULE_SIG_ALL

$ make olddefconfig
$ make kernelrelease
5.15.72-ui-cn9670

$ time make -j$(nproc)
real    1m59s

Five modules built, all with byte-perfect vermagic:

$ for ko in $(find . -name 'nf_tables.ko' -o -name 'nf_flow_table*.ko' | sort); do
    echo "=== $(basename $ko) ==="
    strings $ko | grep -E '^(vermagic|name|depends)='
  done

=== nf_flow_table.ko ===
name=nf_flow_table
vermagic=5.15.72-ui-cn9670 SMP mod_unload aarch64
=== nf_flow_table_inet.ko ===
name=nf_flow_table_inet
depends=nf_flow_table,nf_tables
vermagic=5.15.72-ui-cn9670 SMP mod_unload aarch64
=== nf_flow_table_ipv4.ko ===
name=nf_flow_table_ipv4
depends=nf_flow_table,nf_tables
vermagic=5.15.72-ui-cn9670 SMP mod_unload aarch64
=== nf_flow_table_ipv6.ko ===
name=nf_flow_table_ipv6
depends=nf_flow_table,nf_tables
vermagic=5.15.72-ui-cn9670 SMP mod_unload aarch64
=== nf_tables.ko ===
name=nf_tables
vermagic=5.15.72-ui-cn9670 SMP mod_unload aarch64

12.6 — The BSP Build Crashed at the Same Function Offset

# insmod ./nf_tables.ko
[connection drops, device reboots]

Captured kernel trace before reboot:

[ 3368.013405] Unable to handle kernel NULL pointer dereference at virtual address 0
[ 3368.022216] Mem abort info:
[ 3368.025005]   ESR = 0x96000005
[ 3368.028072]   EC = 0x25: DABT (current EL), IL = 32 bits
[ 3368.033402]   FSC = 0x05: level 1 translation fault
[ 3368.074382] Modules linked in: nf_tables(+) wireguard libchacha20poly1305 ... 
                xt_geoip(O) nf_app(PO) t_miner(PO) tdts(PO) tm_crypto(O) 
                xt_dyn_random ip6table_nat xt_conntrack xt_connmark xt_TCPMSS pppoe 
                pppox bonding xt_dpi(O) ip6table_mangle iptable_mangle ip6table_filter 
                ip6_tables uio_pdrv_genirq ui_lcm(O) ifb ppp_generic slhc 
                ubnthal(PO) ubnt_common(PO) drm drm_panel_orientation_quirks
[ 3368.121977] CPU: 3 PID: 211748 Comm: insmod Tainted: P W O 5.15.72-ui-cn9670 #5.15.72
[ 3368.130936] Hardware name: Marvell OcteonTX CN96XX board (DT)
[ 3368.143638] pc : nf_tables_init_net+0x18/0x94 [nf_tables]
[ 3368.149059] lr : ops_init+0x3c/0x120
[ 3368.227314] x2 : ffff00019027b300 x1 : 0000000000000000 x0 : 0000000000000000
[ 3368.229754] Call trace:
[ 3368.234825]  nf_tables_init_net+0x18/0x94 [nf_tables]
[ 3368.238053]  ops_init+0x3c/0x120
[ 3368.242840]  register_pernet_operations+0xec/0x240
[ 3368.247195]  register_pernet_subsys+0x2c/0x50
[ 3368.252609]  nf_tables_module_init+0x24/0x100 [nf_tables]

Identical crash signature. nf_tables_init_net+0x18, called from the same path.

Two builds:

Source tree	Result
Vanilla Linux 5.15.72 (kernel.org)	Crash at `nf_tables_init_net+0x18`
Marvell BSP `linux-yocto v5.15/standard/cn-sdkv5.15/octeon` HEAD with SUBLEVEL forced to 72	Crash at `nf_tables_init_net+0x18`

If the crash were caused by Marvell BSP patches, the BSP-built module would have crashed somewhere different (or — ideally — not at all). It crashed at the exact same instruction. That tells us:

The crash is NOT primarily caused by Marvell BSP patches; it's caused by something on top of the BSP
Ubiquiti has applied additional, non-public patches to the kernel that affect netfilter per-net data layout
These additional patches are not derivable from any combination of Linux mainline + Marvell's published OCTEON BSP

The Modules linked in line of the panic trace lists the modules already loaded on the EFG when our module tried to initialize:

xt_geoip(O) nf_app(PO) t_miner(PO) tdts(PO) tm_crypto(O) 
xt_dyn_random ip6table_nat xt_conntrack xt_connmark ...
xt_dpi(O) ... ui_lcm(O) ... ubnthal(PO) ubnt_common(PO)

The taint flags (O) and (PO) in Linux's module taint vocabulary mean:

O — out-of-tree module
P — proprietary (non-GPL) module
PO — both proprietary and out-of-tree

The presence of t_miner(PO), tdts(PO), nf_app(PO), xt_geoip(O), xt_dyn_random, tm_crypto(O), xt_dpi(O), ui_lcm(O), ubnthal(PO), and ubnt_common(PO) in the running kernel's module list is documentary evidence of the closed-source kernel modules Ubiquiti is shipping.

Section 14 returns to this point to evaluate the GPL implications.

12.7 — Module Symbol Tables Show Limited Debug Information

Before drawing conclusions, we examined the EFG's existing kernel modules to determine whether Ubiquiti ships debug information that could aid investigation.

$ file /lib/modules/$(uname -r)/kernel/net/netfilter/nf_conntrack_ftp.ko
/lib/modules/.../nf_conntrack_ftp.ko: ELF 64-bit LSB relocatable, ARM aarch64, 
version 1 (SYSV), BuildID[sha1]=5827c50c..., not stripped

$ readelf -S nf_conntrack_ftp.ko | grep -i debug
[30] .gnu_debuglink    PROGBITS         0000000000000000  00001ed0

Modules are not stripped — symbol tables are intact, function and variable names are preserved. However, the only debug section is .gnu_debuglink, which is a 4-byte CRC + filename pointer that says "the actual debug info is in a separate file." That separate file (*.ko.debug) is not shipped on the production firmware.

This is by itself a defensible engineering decision (debug files are large), but combined with MODVERSIONS=N and kptr_restrict=0 (see Section 13 below), it creates a peculiar combination:

A normal user with sufficient privilege can dump the running kernel's complete symbol table at full virtual addresses
But cannot match those symbols to source-level constructs (struct field names, member offsets) without the debug info
And cannot rely on the kernel's own ABI-version tracking to detect mismatched modules

The debug info isn't shipped, so reverse-engineering structure layouts requires examining the binary kernel image directly. Section 13 documents what such an examination reveals.

13. Symbol-Level Forensics on the Running EFG Kernel

The crash at nf_tables_init_net+0x18 told us that the running kernel's internal layout differs from any combination of public sources we could build against. To quantify how far it diverges, we extracted the kernel image from the EFG and compared its symbol table against the symbol tables of vanilla Linux 5.15.72 and our Marvell BSP build.

13.1 — Extracting the Running Kernel

The EFG's kernel image is on disk at /boot/vmlinuz-5.15.72-ui-cn9670:

$ ls -la /boot/vmlinuz-5.15.72-ui-cn9670
-rw-r--r-- 1 root root 12071956 ... /boot/vmlinuz-5.15.72-ui-cn9670

$ file /boot/vmlinuz-5.15.72-ui-cn9670
gzip compressed data, max compression, from Unix, original size 28811776

$ gunzip -c /boot/vmlinuz-5.15.72-ui-cn9670 > efg-vmlinuz
$ binwalk efg-vmlinuz | head -3
DECIMAL    HEXADECIMAL  DESCRIPTION
0          0x0          Linux kernel ARM64 image, load offset: 0x0,
                        image size: 29818880 bytes, little endian, 64k page size

$ strings -a efg-vmlinuz | grep "Linux version"
Linux version 5.15.72-ui-cn9670 (bdd@builder) 
(gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld 2.35.2) 
#5.15.72 SMP Wed Apr 15 23:39:47 CST 2026

The kallsyms symbol table is dumped via /proc/kallsyms:

$ wc -l /proc/kallsyms
130789 /proc/kallsyms

$ head -2 /proc/kallsyms
ffff800008000000 T _text
ffff800008010000 T _stext

We note that kallsyms is unrestricted — full virtual addresses are visible. On most production systems, kernel.kptr_restrict is set to 1 or 2, which causes kallsyms to either redact or zero out the address column. The EFG ships with kptr_restrict=0. This is a security observation in its own right (it makes ROP and KASLR-bypass attacks easier), but for our purposes it provided complete ground-truth symbol data.

13.2 — Three-Way Symbol Comparison

We extracted the symbol tables from each source:

# Symbols in EFG's running kernel
$ awk '{print $3}' /tmp/efg-kallsyms.txt | sort -u > /tmp/efg-syms.txt

# Symbols in our Marvell BSP build
$ nm ~/efg-build/marvell-bsp/linux-yocto-cnxk-5.15/vmlinux \
    | awk '{print $3}' | sort -u > /tmp/bsp-syms.txt

# Symbols in vanilla 5.15.72
$ nm ~/efg-build/vanilla-5.15.72/linux-5.15.72/vmlinux \
    | awk '{print $3}' | sort -u > /tmp/vanilla-syms.txt

$ wc -l /tmp/*-syms.txt
 115998 /tmp/bsp-syms.txt
 120399 /tmp/efg-syms.txt
 112581 /tmp/vanilla-syms.txt

The diff: symbols present in the EFG kernel but absent from BOTH vanilla 5.15.72 AND the Marvell BSP build:

$ comm -23 /tmp/efg-syms.txt \
    <(sort -u /tmp/vanilla-syms.txt /tmp/bsp-syms.txt) \
    | grep -vE "^(\.L[0-9]+|\.LC[0-9]+|\.LBE|\.LFE|\.LFB|\.Letext|\.Ldebug|\.Lframe|__compound_literal\.|__func__\.|__warned\.|CSWTCH\.)" \
    > /tmp/efg-unique-real-syms.txt

$ wc -l /tmp/efg-unique-real-syms.txt
6357 /tmp/efg-unique-real-syms.txt

After filtering out compiler-generated local labels (which vary across every build of every kernel and carry no information), 6,357 unique symbols exist in the EFG's kernel that are present in neither vanilla Linux 5.15.72 nor Marvell's published OCTEON BSP.

13.3 — Categorization of the Unique Symbols

Grouping the unique symbols by name pattern reveals what Ubiquiti added:

Category	Symbol count	Examples
`tdts_*` (Trend Micro Deep-packet Threat Surveillance)	116	`tdts_shell_dpi_l3_skb`, `tdts_shell_dpi_register_mt`
`tm_*` (Trend Micro shared)	33	`tm_crypto_*` family
`ubnthal_*` (Ubiquiti HAL)	45	`ubnthal_get_controller_host`, `ubnthal_get_cputype`
`ubnt_*` (Ubiquiti utilities)	additional	`ubnt_blk_wp_callback`, `ubnt_mtd_partition_read`
HTTP protocol decoder (kernel-space)	dozens	`BuildHTTP_request_KeywordTries`, `Create_HTTP_Protocol_Decoder`
H.323 protocol decoder (kernel-space)	dozens	`DecodeQ931`, `DecodeMultimediaSystemControlMessage`
`nf_dpi` (Deep Packet Inspection conntrack extensions)	several	`nf_conntrack_dpi_init`, `nf_ct_ext_dpi_destroy`, `nf_dpi_proc_dir`
`dpi_*` (Deep Packet Inspection engine)	dozens	`__kstrtab_dpi_main`, related classification entry points
`wg_*` (WireGuard, partly upstream)	113	`wg_*`
Firmware signing key blobs	a few	`UDMENT_CN9670_FW_KEY`, `UXG_AL324_FW_KEY`

A note on terminology: throughout this section, "DPI" refers to Deep Packet Inspection — the application-layer traffic-classification feature that powers the UniFi dashboard's per-application traffic statistics and threat management. This is distinct from Marvell's hardware DPI block (DMA Packet Interface, also abbreviated DPI), which is a PCIe DMA engine on the OCTEON SoC and shows up in the kernel image as register-name strings like DPI_DMA_CONTROL and DPI_REQQ_INT. Those Marvell hardware-driver symbols are present in the public BSP and don't appear in the 6,357-symbol delta. The dpi_*, tdts_*, nf_*dpi*, and xt_dpi symbols below are the inspection-software layer Ubiquiti added on top.

Some of these are unsurprising (ubnthal_* is a clean abstraction layer; WireGuard was upstream by 5.6 but Ubiquiti may have backported aspects). Others are deeply diagnostic.

13.4 — Conntrack Extension for DPI: The Smoking Gun

The most consequential finding is in the nf_* namespace:

nf_conntrack_dpi_fini
nf_conntrack_dpi_init
nf_ct_ext_dpi_destroy
nf_dpi_proc_dir

The Linux conntrack subsystem has an extension framework (include/net/netfilter/nf_conntrack_extend.h) that allows kernel modules to attach per-flow metadata to each struct nf_conn. Adding a new extension type requires changes in both:

enum nf_ct_ext_id in nf_conntrack_extend.h (adding a new value)
The static array nf_ct_ext_types (adding a new entry)
Anywhere code iterates over extension types

The presence of nf_ct_ext_dpi_destroy is direct evidence that Ubiquiti has added a new conntrack extension (NF_CT_EXT_DPI or similar) to track DPI metadata per flow.

This change is precisely the kind that would alter struct nf_conn layout and per-net data structure layout — exactly the kind of change that would explain why nf_tables.ko built against any public source crashes when it tries to register a pernet_operations against the running kernel.

13.5 — `tdts` and `t_miner`: Closed-Source Kernel Modules

Examined more closely, the tdts namespace exposes kernel symbols:

__ksymtab_tdts_shell_dpi_l2_eth
__ksymtab_tdts_shell_dpi_l3_data
__ksymtab_tdts_shell_dpi_l3_skb
__ksymtab_tdts_shell_dpi_register_mt
__ksymtab_tdts_shell_dpi_unregister_mt
__ksymtab_dpi_main

The __ksymtab_* and __kstrtab_* symbols are how the kernel records what symbols a module exports. The names dpi_l2_eth, dpi_l3_data, dpi_l3_skb indicate these are functions for handling Ethernet frames and IPv4/IPv6 packets at layer 2 and layer 3 respectively. The _register_mt and _unregister_mt suffixes are netfilter xt_match (match-target) registration entry points.

The runtime panic dump in Section 12 showed these modules tagged tdts(PO) and t_miner(PO) — proprietary, out-of-tree.

The "tdts" name strongly suggests Trend Micro Smart Protection Network ("TMSPN" — TM Deep-packet Threat Surveillance, abbreviated tdts). Trend Micro licenses their threat-detection engine to network device vendors as a closed-source kernel module. The tm_crypto(O) and t_miner(PO) modules in the same panic trace fit the pattern: t_miner is a content-pattern matcher, tm_crypto is the encrypted-traffic analyzer.

These modules are not Ubiquiti's own code. They are licensed proprietary code from Trend Micro that Ubiquiti has integrated into their firmware. They link directly against kernel symbols (notable per the xt_dpi(O) netfilter match registered in the kernel's tainted-module list).

13.6 — Kernel-Embedded Application-Layer Decoders

The unique symbols also reveal that Ubiquiti has embedded application-layer protocol decoders directly in the kernel:

BuildHTTP_request_KeywordTries
Close_HTTP_Request_Connection
Create_HTTP_Protocol_Decoder
Free_HTTP_Protocol_Decoder
HTTP_Connection_Lost_Count
HTTP_Req_Count
Init_HTTP_Protocol_Decoder
NormalizeURI
Parse_HTTP_Request
ScanHTTPVersion
ScanRequestHeaders
URINormalize

DecodeMultimediaSystemControlMessage
DecodeQ931
DecodeRasMessage
_AdmissionConfirm
_AdmissionRequest
_Alerting_UUIE

The HTTP decoder symbols (camelCase, with _HTTP_ infix) appear to be from a Trend Micro protocol-parsing library running in kernel space. The H.323/Q.931 decoder symbols are similarly out-of-place for a kernel — these would normally live in userspace.

Running parsers for HTTP, H.323, and similar attacker-controllable formats inside the kernel is a substantial security risk. A bug in any of these decoders becomes a kernel vulnerability. Mainstream Linux distributions and other vendors deliberately keep this kind of code in userspace (Suricata, Snort, etc.) for exactly this reason.

13.7 — What the 6,357 Symbol Delta Means

To put 6,357 symbols in perspective:

Vanilla 5.15.72 has 112,581 unique symbols
Marvell's published BSP adds 3,417 net new symbols on top (a 3% increase)
Ubiquiti's running kernel has 6,357 symbols beyond Marvell's BSP — a further 5.5% increase

Phrased differently: roughly 1 in 19 symbols in the EFG's running kernel did not come from any source publicly available to a security researcher, GPL-rights-exercising customer, or independent third party.

This is the kernel that handles your VLAN traffic, your firewall rules, your VPN keys, and your DPI inspection. The behavior of this kernel cannot be audited from outside because the source for 5% of it is not published. The technical analysis in Section 12 demonstrates that this 5% includes substantial netfilter modifications.

14. The GPL Compliance Question

14.1 — What the GPL Requires

The Linux kernel is licensed under GPL-2.0. That license imposes specific obligations on anyone who distributes a binary derived from GPL-licensed source. The relevant provisions, summarized:

The complete corresponding source code must be made available to recipients of the binary, under the same license, for at least three years (GPL-2.0 §3).
Changes to GPL'd source files must themselves be GPL-licensed (GPL-2.0 §2, the "viral" clause).
Linking proprietary modules against GPL kernel symbols is a contested legal area. Linus Torvalds and the Linux Foundation's longstanding position is that modules that use only EXPORT_SYMBOL (not EXPORT_SYMBOL_GPL) interfaces and "can plausibly be shown to be independent" may be distributed under non-GPL licenses, but there is no clean legal answer here. The Free Software Foundation's position is stricter: any kernel module is a derived work.
A written offer to provide source must accompany the binary distribution, valid for at least three years.
Derived works that combine GPL and proprietary code in linked form typically must be GPL-licensed in their entirety.

14.2 — Where Ubiquiti Stands on These Obligations

14.2.1 — Has Ubiquiti released the kernel source?

Ubiquiti previously maintained an open-source download page at ui.com/download/open-source, but that page no longer exists. As of this writing (May 2026), Ubiquiti's main website does not host any GPL source code archives that we could locate. The Ubiquiti GitHub organization (https://github.com/ubiquiti) contains only two repositories: support-tools and freeswitch. Neither contains kernel sources or firmware sources for any current product.

This is not the first time Ubiquiti's GPL compliance has been questioned. The Wikipedia article on Ubiquiti documents a recurring pattern:

2015: Ubiquiti was accused of violating GPL terms for code in their products. Specifically, customers requested the source for the GPL-licensed U-Boot bootloader and Ubiquiti refused, making it impractical for customers to fix a security issue. The source was eventually released after sustained public pressure.
2019: Ubiquiti was again reported to be in violation of GPL.
2026 (current): The open-source download page that previously hosted source archives has been removed entirely.

For an EFG owner attempting to exercise their GPL rights today, the channels are:

The Ubiquiti support email (support@ui.com), which redirects GPL requests to a separate address
A specific email for source requests: opensource-requests@ui.com
Community forum posts (which historically receive no substantive Ubiquiti response on GPL questions)
Third-party archives like github.com/unifi-hackers/unifi-gpl and github.com/CodeFetch/Ubiquiti-UBNT-airOS, which contain partial GPL sources that researchers have extracted from firmware images or obtained through pressure

A formal request for the complete kernel source has been filed via opensource-requests@ui.com, the email address Ubiquiti's support team directed users to. The request specifies:

The full kernel source tree corresponding to the running kernel version
The build configuration (/proc/config.gz)
The complete set of patches applied on top of the base kernel
The Marvell-specific drivers (octeontx2_pf, octeontx2_vf, octeontx2_af, rvu_*, NIX, CPT, SSO, NPA)
Any other GPL components

The request is pending. Ubiquiti's response (or non-response) to this request is itself a data point.

14.2.2 — What we now know is missing

Section 13 documents 6,357 unique kernel symbols in the running EFG kernel that are not present in either vanilla Linux 5.15.72 or the complete published Marvell OCTEON CN9K BSP. These include:

Symbols indicating modifications to core netfilter conntrack data structures (nf_ct_ext_dpi_destroy, nf_conntrack_dpi_init)
A 116-symbol tdts namespace exposing kernel functions to a closed-source DPI engine
HTTP and H.323 application-layer protocol decoders embedded in the kernel
A 45-symbol Ubiquiti hardware abstraction layer

For Ubiquiti to be in compliance with GPL-2.0, the source of the changes producing these symbols must be available — at minimum to anyone who has purchased an EFG and exercises their GPL rights to request it.

14.2.3 — The proprietary kernel modules

Section 12 documented the panic trace's Modules linked in list, which included:

xt_geoip(O) nf_app(PO) t_miner(PO) tdts(PO) tm_crypto(O) 
xt_dyn_random xt_dpi(O) ui_lcm(O) ubnthal(PO) ubnt_common(PO)

The (PO) taint flag is the kernel's own classification. It means the module is loaded with a MODULE_LICENSE() declaration that is not one of the GPL-compatible strings. The kernel taints itself when such modules are loaded specifically because their continued operation calls into question the kernel's GPL status.

Among these:

tdts and t_miner are almost certainly licensed proprietary code from Trend Micro. They register xt_match netfilter hooks and export functions like tdts_shell_dpi_l3_skb. They link directly against GPL kernel symbols (the __kstrtab_* and __ksymtab_* infrastructure exists for this purpose).
nf_app, xt_dpi, xt_geoip are likely Ubiquiti's own proprietary netfilter extensions that integrate with the DPI engine.
ubnthal, ubnt_common, ui_lcm are Ubiquiti's hardware abstraction layer.

The legal status of these modules is contested in general terms. The specific question for Ubiquiti is: are these modules "derived works" of the kernel? The Free Software Foundation says any kernel module is. Linus Torvalds has historically said it depends on whether the module uses EXPORT_SYMBOL_GPL interfaces and on whether the module has independent existence outside of Linux.

For tdts specifically: Trend Micro markets the underlying technology as portable across operating systems (it runs on Windows, FreeBSD, etc.), which would weigh in favor of "independent existence" under Torvalds's standard. For nf_app, xt_dpi, and ubnthal: these are by name and design Ubiquiti-specific kernel-only modules; they have no plausible existence independent of Ubiquiti's Linux distribution. Under either FSF's or Torvalds's standard, nf_app, xt_dpi, and ubnthal would appear to be derived works of the kernel and therefore subject to GPL.

14.2.4 — The most concerning finding

The closed-source modules link against GPL kernel symbols using EXPORT_SYMBOL and EXPORT_SYMBOL_GPL exports. Some of those exports — particularly conntrack extension registration — were added by Ubiquiti's own kernel patches (per Section 13).

In other words: Ubiquiti modified the kernel (a GPL'd derived work, requiring source release) specifically to add GPL'd interfaces that proprietary modules would link against. Whether this is a GPL violation depends on the resolution of the GPL-vs-proprietary-module question, but it is a structurally significant observation: the proprietary modules and the kernel patches are designed to work together as a single integrated system. The kernel cannot be replaced without breaking the proprietary modules; the proprietary modules cannot run on any other kernel.

That tight integration is what FSF would call "a single program in two pieces" — a derived work. Under that interpretation, the entire firmware would need to be GPL-licensed, and the proprietary modules would be in violation.

14.3 — What This Means for EFG Owners

If you own an EFG, you have a legal right under GPL-2.0 to request the complete source code of the kernel running on your device. That includes:

The base kernel source, with full version history
All patches applied by Ubiquiti and any third parties
The build configuration (.config)
Any installation/build scripts necessary to reconstruct the binary
The kernel modules whose source is GPL

This right cannot be waived by EULA. If Ubiquiti refuses to provide this source, that refusal is a violation of GPL-2.0 §3, and the appropriate path forward is:

Make a written request to opensource-requests@ui.com specifying the firmware version
If no response within 30 days, escalate to Ubiquiti's legal department
If still no response, contact the Software Freedom Conservancy at compliance@sfconservancy.org — they handle GPL enforcement on behalf of multiple Linux kernel copyright holders
The Conservancy can pursue compliance via the kernel-enforcement program

14.4 — Why This Matters Beyond One Vendor

The EFG is a flagship enterprise router from a publicly-traded networking vendor (Ubiquiti, NYSE: UI). It is sold to enterprises, cloud providers, government agencies, and home users. The firmware running on it includes 6,357 kernel symbols that no customer can audit because the source is not published.

Network device firmware is some of the most security-sensitive software in any infrastructure. The kernel running on a firewall or router decides what packets enter and leave the network. Bugs and backdoors in that kernel directly affect every device behind it.

GPL-2.0 was specifically designed to ensure that customers and security researchers can audit the software running on the devices they own. Vendor compliance with the license is not a courtesy — it is a precondition for the trust the GPL ecosystem makes possible.

The findings in this document — that even Marvell's complete public BSP source is insufficient to build modules that work on the EFG, that 6,357 symbols are unique to Ubiquiti's kernel, and that closed-source modules with (PO) taint flags are integrated with the netfilter subsystem — are exactly the kind of findings that demonstrate why GPL compliance is important. The license requires that this kind of analysis be unnecessary, because the source should be available.

15. Direct Vendor Engagement: What Ubiquiti Has Already Been Told

Many of the findings in this document have already been raised with Ubiquiti through their official channels. The vendor's responses are themselves part of the record.

15.1 — The performance issue, raised approximately one year ago

The author of this document opened a support ticket with Ubiquiti approximately one year prior to publication, describing the inter-VLAN performance bottleneck on the EFG and proposing the architectural fix in detail — specifically, recommending that Ubiquiti adopt the DPDK + VPP + Suricata-on-DPDK reference architecture that Marvell themselves publish for the OCTEON CN9K silicon family.

The ticket has not received a substantive engineering response. It remains effectively open without resolution.

This means the central technical recommendation of this document — that the EFG can deliver substantially higher throughput by adopting the dataplane architecture its silicon vendor publishes — was already in Ubiquiti's hands a year ago, with implementation guidance, and was not acted upon.

15.2 — The security architecture, raised through the bounty program

Section 12.1 of this document catalogues the security configuration choices in the EFG's running kernel:

module.sig_enforce=0 — modules can be loaded without signature verification
CONFIG_MODULE_SIG not set — the kernel was not even built with signing infrastructure
No lockdown= argument on the kernel command line — Secure Boot LSM is not engaged
CONFIG_SECURITY_LOCKDOWN_LSM not set in the kernel build
Overlayfs root filesystem with a writable upper layer — kernel-loadable code can be persisted
kernel.kptr_restrict=0 — the full kallsyms table with virtual addresses is exposed

Combined with the kernel's CONFIG_MODVERSIONS=N setting (Section 12.4), this means: any process with CAP_SYS_MODULE (root, including any context that escalates to root) can load arbitrary kernel code, and there is no in-kernel mechanism to detect or prevent that loading. The watchdog will reboot the device on a kernel panic, but a successfully-loaded malicious module that doesn't crash the kernel would persist indefinitely.

Separately, the author identified additional security findings on the EFG — notably the presence of private cryptographic key material accessible via the firmware image (per the *_FW_KEY strings observed in Section 13.6's symbol analysis, alongside other findings not detailed here for responsible disclosure reasons).

These findings were submitted through Ubiquiti's HackerOne bug bounty program — the formal, documented channel for security disclosure to the vendor.

Ubiquiti rejected the submission. The stated reason: the attacker would require network access to exploit the issue.

This rationale does not survive scrutiny when applied to a network gateway:

A network gateway is, by definition, on the network. Network access to the device is the universal precondition for any attack against it.
The threat model that a security-conscious gateway is designed to defend against is precisely "an attacker who has gained network access" — whether that's a compromised endpoint behind the gateway, a hostile guest device on the same VLAN, or an internal lateral-movement scenario in an enterprise breach.
Gateway vendors with mature security postures (Cisco, Juniper, Palo Alto, Fortinet, Arista, etc.) routinely accept and remediate vulnerabilities under this threat model. CVEs against these products list "network adjacent" or "network reachable" as the qualifying attack vector, not a disqualifying one.
The official CVSS v3.1 scoring system explicitly defines "Adjacent Network" (AV:A) and "Network" (AV:N) as valid attack vectors. A vendor declining to engage with vulnerabilities in those classes is declining to engage with most of the vulnerability landscape for their product category.

The rejection is therefore not just a technical disagreement — it is a stated position on what kinds of attacks Ubiquiti considers in scope for their bounty program. By that stated standard, an attacker who has already established a foothold on the network behind the EFG is not a threat the EFG considers itself responsible for defending against. That is an unusual posture for a $2,000 device sold and marketed as an enterprise security gateway.

15.3 — The pattern this establishes

Putting these data points together with the GPL findings in Section 14:

Issue raised	Channel	Year	Vendor response
Inter-VLAN performance, with DPDK fix recommendation	Standard support	~1 year ago	No substantive engineering response
Security configuration / private key exposure	HackerOne bug bounty	Recent	Rejected: "requires network access"
GPL kernel source release	Email to opensource-requests@ui.com	Pending	Pending
GPL kernel source release	Public web page	Historical	Page removed

The historical context is also relevant: Ubiquiti was publicly accused of GPL violations in 2015 and again in 2019, and the pattern has continued.

The findings in this document are not surprising vendor disclosures. They are issues that engineering, security, and licensing teams within the vendor have either been told about or are demonstrably aware of and have chosen not to act on. The reason this document exists in public form is that the channels designed for these conversations — support tickets, bug bounty programs, GPL compliance contacts — have not produced action.

16. Conclusion

This investigation began as a performance analysis: why does a $2,000 enterprise router with two 25 GbE SFP28 ports deliver only ~1 Gbps of single-stream inter-VLAN throughput, and ~3 Gbps of single-stream PPPoE WAN throughput? The lab data is unambiguous. The bottlenecks are software-architectural choices, not hardware limitations:

The kernel network stack on a single core has a ~5 Gbps single-stream ceiling when offloads are off, regardless of CPU vendor.
Hardware offloads are disabled by default on the EFG. Enabling them is a 4-7× improvement on otherwise-identical configurations.
The 5-deep iptables FORWARD chain pattern the EFG ships with costs roughly half of single-stream throughput when offloads are also off.
nftables flowtable — a kernel feature available since Linux 4.16, shipped enabled by every major distribution, is not even compiled into the EFG's kernel. Adding it gives 3-7× single-stream improvement.
DPDK + VPP on the same silicon — using software stacks that Marvell themselves publish — would deliver 15-25× the throughput. The Cortex-A72-class cores in the Octeon CN9670 can sustain 6-12 Gbps per core in a userspace dataplane. The chip has 18 of those cores.
PPPoE forwarding is single-cored in stock Linux because of how ppp_generic is structured. The fix exists in DPDK and was being upstreamed at time of writing.

These are not exotic or research-grade fixes. Three of them are configuration changes. One requires loading a kernel module that's already in mainline. The most architecturally significant — DPDK + VPP — uses Marvell's own published reference architecture. The hardware was designed for this; the firmware just doesn't use it.

The conntrack helper toggle Ubiquiti recently shipped in the UniFi controller (Section 10 Finding 5, Section 11 Fix 7) is informative beyond its narrow effect. It exposes the FTP/H.323/SIP/PPTP/TFTP helpers as administrator-controllable. The toggle's existence proves Ubiquiti's engineering team is actively reasoning about per-flow netfilter overhead — they identified that helpers cost something, and shipped a workaround to let users disable them. They did not ship the proper fix, which is the kernel's flowtable infrastructure, even though the proper fix would address every architectural finding in this document and the partial fix addresses only one. That is a choice, not an oversight.

Section 9 extended the analysis from the EFG to the UDM Beast — Ubiquiti's next-generation gateway with newer Marvell Octeon CN10K silicon, ARM Neoverse N2 cores, an 18-month-newer kernel, and a dedicated Marvell switching ASIC. Direct diagnostics show the same architectural pattern: switchdev offload hard-disabled across every interface, tc filter rules explicitly tagged not_in_hw, 67 GB of WAN traffic processed through CPU-only software paths. The dedicated ASIC handles 1.27 billion packets of intra-VLAN switching but is bypassed for inter-VLAN routing. A faster CPU and a switching ASIC do not fix the architecture; they just raise the floor.

Section 12 documented our attempt to apply the most surgical of these fixes — adding the missing nftables flowtable kernel modules — to a real production EFG. Two builds were attempted:

Vanilla Linux 5.15.72 from kernel.org → byte-perfect vermagic match → kernel panic at nf_tables_init_net+0x18
Marvell's complete published OCTEON BSP source (linux-yocto branch v5.15/standard/cn-sdkv5.15/octeon) → byte-perfect vermagic match → kernel panic at the identical instruction

The fact that both crashes occurred at the same function offset proves that the ABI mismatch is not introduced by Marvell's BSP patches. It is introduced by something Ubiquiti has applied on top of Marvell's BSP — patches Ubiquiti has not published.

Section 14 quantified that delta: 6,357 kernel symbols exist in the running EFG kernel that are present in neither vanilla Linux 5.15.72 nor Marvell's complete public BSP. Approximately 1 in 19 symbols in the EFG's kernel is unique to Ubiquiti's build and not derivable from any public source. These include:

Conntrack extension types for proprietary DPI integration (nf_ct_ext_dpi_destroy, nf_conntrack_dpi_init)
A 116-symbol tdts namespace exposing kernel internals to a closed-source Trend Micro DPI engine
HTTP and H.323 application-layer protocol decoders running in kernel space
A 45-symbol Ubiquiti hardware abstraction layer

Section 14 addressed what these findings mean for GPL-2.0 compliance:

Ubiquiti has shipped a substantially modified Linux kernel without publishing the corresponding source
The proprietary kernel modules tdts, t_miner, nf_app, xt_dpi, ubnthal, and ubnt_common link against GPL kernel symbols and operate as integrated components of the running kernel
Specifically, nf_app, xt_dpi, and ubnthal have no existence independent of Ubiquiti's Linux integration and would be derived works under either FSF's or Linus Torvalds's interpretation of the GPL
Ubiquiti's open-source download page has been removed; their GitHub presence does not contain firmware sources
This continues a documented pattern — Ubiquiti was publicly accused of GPL violations in 2015 (resolved only after sustained pressure) and again in 2019
A formal request has been filed via the channel Ubiquiti's support team specified

The GPL exists specifically so that customers can audit and modify the software running on devices they own. The fact that this analysis required reverse-engineering kernel symbol tables from a binary firmware image — when the GPL requires the source be available on request — is itself the finding.

Section 15 documented direct vendor engagement: a performance ticket open with Ubiquiti for approximately one year recommending the DPDK fix (no substantive engineering response), a security disclosure submitted through Ubiquiti's HackerOne bug bounty program (rejected on the grounds that exploitation requires network access — a position that does not survive scrutiny when applied to a network gateway), and the GPL request now pending. The findings in this document are not novel disclosures to the vendor; they are issues the vendor has been told about, through the channels designed for these conversations, and has chosen not to act on.

What enterprise customers should ask Ubiquiti

If you are evaluating or already operating EFG/UDM/UXG hardware, the questions to put to your Ubiquiti account team are:

Performance: When will inter-VLAN single-stream throughput on the EFG match the marketed 25 GbE port speeds for normal enterprise workloads (TCP, MTU 1500, with stateful firewall rules)?
Roadmap: Does Ubiquiti's roadmap include adopting DPDK-based dataplanes (which Marvell's reference architecture for this silicon recommends and supports)?
Configuration: Will Ubiquiti expose nftables flowtable, hardware offload, and conntrack helper toggles as administrator-controllable settings before any DPDK migration?
GPL compliance: Will Ubiquiti publish the complete kernel source corresponding to current EFG firmware versions, including all patches, build configuration, and the source of nf_app, xt_dpi, ubnthal, and ubnt_common?

The first three are about getting the performance you paid for. The fourth is about knowing what's running on your network.

What home and prosumer users should know

The EFG, UDM Beast, UXG-Lite, UXG-Pro, and other Ubiquiti gateways share substantial portions of this kernel and firmware design. The Section 9 cross-generation analysis on the UDM Beast establishes that the architectural pattern is not specific to one product or one silicon generation — it persists across newer SoCs, newer kernels, and even with dedicated switching ASIC hardware available. The performance characteristics documented here for the EFG are likely to apply, with proportional differences in absolute numbers, across the product line.

If your home or small-office workload is dominated by single-stream throughput (a single VPN tunnel, a single large file transfer, a single backup job), you are likely bottlenecked by the issues described above, regardless of how fast your internet connection or LAN switch is.

The most impactful workaround available without firmware changes is to enable hardware offloads where Ubiquiti's UI exposes the toggle. Beyond that, the architectural fix is in Ubiquiti's hands.

17. Appendix: Full Data Sets

A.1 — Complete Test Matrix

#	NIC	Forwarder	MTU	Offloads	Rules	Single-stream	Notes
1	virtio	kernel	9000	on	none	16.9 Gbps	naïve baseline
2	virtio	kernel	9000	off	none	17.2 Gbps	jumbo hides per-packet cost
3	virtio	kernel	1500	off	none	4.95 Gbps	EFG-realistic baseline; 1 core 100% soft
4	virtio	kernel	1500	off	+ ct module	4.84 Gbps	trivial overhead
5	virtio	kernel	1500	off	+ simple ct rule	4.64 Gbps	4% drop
6	virtio	kernel	1500	off	EFG 5-chain replica	2.36 Gbps	smoking gun
7	virtio	kernel	1500	off	EFG (8 streams)	11.4 Gbps agg	scales with cores
A	virtio	kernel	1500	off	flowtable	7.05 Gbps	flowtable alone, 3× over EFG
B	virtio	kernel	1500	on	flowtable	17.4 Gbps	one-line config improvement
K1	ConnectX VF	kernel	1500	on	none	25.3 Gbps	real silicon baseline
K2	ConnectX VF	kernel	1500	on	EFG 5-chain	21.1 Gbps	GRO hides per-packet cost
K3	ConnectX VF	kernel	1500	off	none	4.74 Gbps	matches virtio with offloads off
K4	ConnectX VF	kernel	1500	off	EFG 5-chain	4.70 Gbps	I/O is the bottleneck here
V0	virtio	VPP/DPDK	1500	off	n/a	6.78 Gbps	DPDK with virtio-pmd; bottlenecked by vhost-net
V1	ConnectX VF	VPP/DPDK	1500	client off	n/a	15.7 Gbps	wire-packet processing
V2	ConnectX VF	VPP/DPDK	1500	client on	n/a	35.6 Gbps	headline number

A.2 — EFG-Replica nftables Ruleset

#!/usr/sbin/nft -f
flush ruleset

table inet filter {
    chain alien_chain {
        counter
        ip protocol tcp counter
        ip saddr 10.0.0.0/8 counter
    }
    chain tor_chain {
        counter
        ip protocol tcp counter
        tcp flags & (syn|ack) == ack counter
    }
    chain ips_chain {
        counter
        ip protocol tcp counter
        meta l4proto tcp counter
        tcp dport { 1-65535 } counter
    }
    chain ubios_chain {
        counter
        ip protocol tcp counter
        ct state established counter
    }
    chain user_chain {
        counter
        ct state established,related counter
        ip saddr 10.10.10.0/24 ip daddr 10.10.20.0/24 counter
    }

    chain forward {
        type filter hook forward priority 0; policy accept;
        jump alien_chain
        jump tor_chain
        jump ips_chain
        jump ubios_chain
        jump user_chain
    }
}

table ip nat {
    chain postrouting {
        type nat hook postrouting priority 100;
        oifname "enp6s18" masquerade
    }
}

A.3 — flowtable Configuration

#!/usr/sbin/nft -f
flush ruleset

table inet filter {
    flowtable f {
        hook ingress priority 0
        devices = { enp6s19, enp6s20 }
    }

    chain forward {
        type filter hook forward priority 0; policy accept;
        ip protocol { tcp, udp } flow add @f
        ct state established,related accept
    }
}

table ip nat {
    chain postrouting {
        type nat hook postrouting priority 100;
        oifname "enp6s18" masquerade
    }
}

A.4 — VPP startup.conf (ConnectX-6 Dx)

unix {
    nodaemon
    log /var/log/vpp/vpp.log
    full-coredump
    cli-listen /run/vpp/cli.sock
    gid vpp
}

api-trace { on }
api-segment { gid vpp }
socksvr { default }

cpu {
    main-core 0
    corelist-workers 1
}

buffers {
    buffers-per-numa 32768
    default data-size 2048
}

dpdk {
    dev 0000:01:00.0 {
        name lab-vlan10
        num-rx-queues 1
        num-tx-queues 1
    }
    dev 0000:02:00.0 {
        name lab-vlan20
        num-rx-queues 1
        num-tx-queues 1
    }
}

plugins {
    plugin default { enable }
    plugin dpdk_plugin.so { enable }
}

A.5 — VPP show runtime (during V2 test, 35.6 Gbps)

Thread 1 vpp_wk_0 (lcore 1)
Time 257.0, vector rate 3.5586e5 in/out, packets/sec
              Name           Calls       Vectors    Packet-Clocks  Vectors/Call
dpdk-input    polling   2683609353    91446442         4.25e3         .03
ethernet-input  active   12518445    91446442         9.41e1        7.30
ip4-input-no-checksum    12136093    91446437         3.98e1        7.54
ip4-lookup     active   12136093    91446437         5.23e1        7.54
ip4-rewrite    active   12136093    91446437         3.86e1        7.54
lab-vlan20-output active 10310280   89229310         1.21e1        8.65
lab-vlan20-tx  active   10310280    89229310         3.79e1        8.65

VPP per-packet end-to-end cost on Zen 4: ~80 cycles (ethernet-input + ip4-input + ip4-lookup + ip4-rewrite + interface-output + tx) ≈ 16 nanoseconds per packet at 5 GHz. Theoretical ceiling on this pipeline: ~700+ Gbps single-core.

A.6 — EFG Live Diagnostics (representative excerpts)

$ uname -a
Linux EFG-Home-SP 5.15.72-ui-cn9670 #5.15.72 SMP Wed Apr 15 23:39:47 CST 2026 aarch64

$ iptables -L FORWARD -n -v --line-numbers
Chain FORWARD (policy ACCEPT)
1     555K  775M   ALIEN
2    2764K 4489M   TOR
3     238M  354G   IPS
4     874M 1342G   UBIOS_FORWARD_JUMP

$ nft list flowtables
[empty]

$ lsmod | grep nf_flow_table
[empty]

$ ps -eo pid,pcpu,comm --sort=-pcpu | head -8
4098469 39.6 dpi-flow-stats
   3139 12.5 ubios-udapi-ser
  66687  7.8 java
   4891  7.0 conntrackd
2491041  6.9 Suricata-Main
   5505  6.2 mcad
   8596  3.9 unifi-core

$ sysctl net.netfilter.nf_conntrack_max
net.netfilter.nf_conntrack_max = 10485760

$ lsmod | grep nf_conntrack | grep -v '^nf_conntrack '
nf_conntrack_tftp     262144  1 nf_nat_tftp
nf_conntrack_pptp     327680  1 nf_nat_pptp
nf_conntrack_h323     327680  1 nf_nat_h323
nf_conntrack_ftp      327680  1 nf_nat_ftp

A.7 — Module Build Artifacts (Experiment in Section 12)

Cross-compilation environment:

Host: Threadripper Pro 7995WX, Ubuntu 24.04 LTS VM, 16 vCPU, 32 GB RAM
Toolchain: gcc-10-aarch64-linux-gnu 10.5.0 from Ubuntu universe repo
Kernel source: linux-5.15.72.tar.xz from kernel.org (verified SHA256)
Build configuration: EFG's exposed /proc/config.gz plus three module enables for NF_TABLES, NF_FLOW_TABLE, NF_FLOW_TABLE_INET
LOCALVERSION: -ui-cn9670 (matching the EFG's published version string)
Build time: 1 minute 52 seconds (16-thread parallel build)

Modules produced:

net/netfilter/nf_tables.ko          (10.3 MB)
net/netfilter/nf_flow_table.ko       (1.8 MB)
net/netfilter/nf_flow_table_inet.ko  (495 KB)

Vermagic verification (build host):

$ for ko in nf_tables.ko nf_flow_table.ko nf_flow_table_inet.ko; do
    strings $ko | grep ^vermagic
done
vermagic=5.15.72-ui-cn9670 SMP mod_unload aarch64
vermagic=5.15.72-ui-cn9670 SMP mod_unload aarch64
vermagic=5.15.72-ui-cn9670 SMP mod_unload aarch64

Vermagic verification (EFG, in-tree module):

$ modinfo nf_conntrack_ftp | grep vermagic
vermagic: 5.15.72-ui-cn9670 SMP mod_unload aarch64

Match: exact, character-for-character.

Kernel panic on load attempt (insmod ./nf_tables.ko):

Unable to handle kernel NULL pointer dereference at virtual address 0x0000000000000120
ESR = 0x96000005, EC = 0x25: DABT (current EL), IL = 32 bits
FSC = 0x05: level 1 translation fault
[0000000000000120] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000
Internal error: Oops: 96000005 [#1] SMP
Code: 910003fd b9432021 f9000bf3 f9455400 (f8615813)
Kernel panic — not syncing: Oops: Fatal exception

Recovery: watchdog hard-reboot, ~2 minute downtime, no permanent damage. Failover to secondary gateway functioned correctly throughout.

Root cause: CONFIG_MODVERSIONS is disabled in the EFG's kernel config, so symbol-CRC verification did not catch the binary ABI mismatch between vanilla 5.15.72 and Ubiquiti's patched 5.15.72-ui-cn9670 build at module load time. The module linked successfully against the running kernel but encountered mismatched struct layouts during init, dereferencing a NULL pointer in the netfilter subsystem.

GPL source request status: filed with opensource-requests@ui.com requesting the complete corresponding source code for kernel 5.15.72-ui-cn9670, including all Ubiquiti and Marvell patches, build configuration, toolchain version, and packaging scripts. Outcome will determine whether the experiment can be re-attempted with a kernel tree that produces ABI-compatible modules.

All measurements were taken on a single physical machine over a continuous test session. Configuration files, scripts, and raw iperf3 outputs are available on request.

A.8 — BSP Build Artifacts (Section 12.5–12.6)

Build environment (same VM as A.7):

Ubuntu 24.04 LTS, 16 vCPU, 32 GB RAM
gcc-10-aarch64-linux-gnu 10.5.0
linux-yocto repository, branch v5.15/standard/cn-sdkv5.15/octeon
Repository URL: https://git.yoctoproject.org/linux-yocto.git

Tree state:

$ git branch --show-current
v5.15/standard/cn-sdkv5.15/octeon

$ git log --oneline -3
7f33f19a49e6 (HEAD) Merge branch 'v5.15/standard/base' into v5.15/standard/cn-sdkv5.15/octeon
65333c3a0bcd Merge tag 'v5.15.203' into v5.15/standard/base
b9d57c40a767 Linux 5.15.203

Modifications to make HEAD identify as 5.15.72:

$ sed -i 's/^SUBLEVEL = .*/SUBLEVEL = 72/' Makefile
$ touch .scmversion   # suppress dirty marker
$ make kernelrelease
5.15.72-ui-cn9670

Configuration (using EFG's /proc/config.gz as base):

CONFIG_LOCALVERSION="-ui-cn9670"
CONFIG_NF_TABLES=m
CONFIG_NF_TABLES_INET=y
CONFIG_NF_TABLES_IPV4=y
CONFIG_NF_TABLES_IPV6=y
CONFIG_NF_FLOW_TABLE=m
CONFIG_NF_FLOW_TABLE_INET=m
CONFIG_NF_FLOW_TABLE_IPV4=m
CONFIG_NF_FLOW_TABLE_IPV6=m
CONFIG_NF_FLOW_TABLE_PROCFS=y
# CONFIG_DEBUG_INFO_BTF is not set
# CONFIG_MODULE_SIG is not set

Build output:

$ time make -j16
real    1m59s
user    23m50s
sys     4m33s

$ for ko in $(find . -name 'nf_tables.ko' -o -name 'nf_flow_table*.ko' | sort); do
    echo "=== $(basename $ko) ==="
    strings $ko | grep -E '^(vermagic|name|depends|description)='
  done

=== nf_flow_table_ipv4.ko ===
description=Netfilter flow table support
depends=nf_flow_table,nf_tables
name=nf_flow_table_ipv4
vermagic=5.15.72-ui-cn9670 SMP mod_unload aarch64

=== nf_flow_table_ipv6.ko ===
description=Netfilter flow table IPv6 module
depends=nf_flow_table,nf_tables
name=nf_flow_table_ipv6
vermagic=5.15.72-ui-cn9670 SMP mod_unload aarch64

=== nf_flow_table.ko ===
description=Netfilter flow table module
depends=
name=nf_flow_table
vermagic=5.15.72-ui-cn9670 SMP mod_unload aarch64

=== nf_flow_table_inet.ko ===
description=Netfilter flow table mixed IPv4/IPv6 module
depends=nf_flow_table,nf_tables
name=nf_flow_table_inet
vermagic=5.15.72-ui-cn9670 SMP mod_unload aarch64

=== nf_tables.ko ===
depends=
name=nf_tables
vermagic=5.15.72-ui-cn9670 SMP mod_unload aarch64

Crash trace from EFG load attempt:

[ 3368.013405] Unable to handle kernel NULL pointer dereference at virtual address 0
[ 3368.022216] Mem abort info:
[ 3368.025005]   ESR = 0x96000005
[ 3368.028072]   EC = 0x25: DABT (current EL), IL = 32 bits
[ 3368.033402]   FSC = 0x05: level 1 translation fault
[ 3368.074382] Modules linked in: nf_tables(+) wireguard libchacha20poly1305 ...
                xt_geoip(O) nf_app(PO) t_miner(PO) tdts(PO) tm_crypto(O)
                xt_dyn_random ip6table_nat xt_conntrack xt_connmark xt_TCPMSS pppoe
                pppox bonding xt_dpi(O) ip6table_mangle iptable_mangle ip6table_filter
                ip6_tables uio_pdrv_genirq ui_lcm(O) ifb ppp_generic slhc
                ubnthal(PO) ubnt_common(PO) drm drm_panel_orientation_quirks
[ 3368.121977] CPU: 3 PID: 211748 Comm: insmod Tainted: P W O 5.15.72-ui-cn9670 #5.15.72
[ 3368.130936] Hardware name: Marvell OcteonTX CN96XX board (DT)
[ 3368.143638] pc : nf_tables_init_net+0x18/0x94 [nf_tables]
[ 3368.149059] lr : ops_init+0x3c/0x120
[ 3368.227314] x2 : ffff00019027b300 x1 : 0000000000000000 x0 : 0000000000000000
[ 3368.234825]  nf_tables_init_net+0x18/0x94 [nf_tables]
[ 3368.238053]  ops_init+0x3c/0x120
[ 3368.242840]  register_pernet_operations+0xec/0x240
[ 3368.247195]  register_pernet_subsys+0x2c/0x50
[ 3368.252609]  nf_tables_module_init+0x24/0x100 [nf_tables]
[ 3368.297899] ---[ end trace d3e1e407900e8e95 ]---
[ 3368.316500] Kernel panic - not syncing: Oops: Fatal exception

The HA failover handled the brief outage; service downtime was approximately 8 seconds.

A.9 — Symbol Comparison Methodology (Section 13)

# Step 1: Extract EFG kernel image (already gzip-compressed PE/COFF aarch64 image)
# from EFG: /boot/vmlinuz-5.15.72-ui-cn9670 (12 MB)
$ gunzip -c /boot/vmlinuz-5.15.72-ui-cn9670 > efg-vmlinuz
$ binwalk efg-vmlinuz | head -3
0    0x0    Linux kernel ARM64 image, image size: 29818880 bytes

# Step 2: Capture running symbol table (kallsyms is unrestricted on EFG)
# from EFG:
$ cat /proc/kallsyms > /tmp/efg-kallsyms.txt
$ wc -l /tmp/efg-kallsyms.txt
130789

# Step 3: Build vanilla 5.15.72 vmlinux (full build, not just modules)
$ cd ~/efg-build/vanilla-5.15.72/linux-5.15.72
$ make -j16 vmlinux

# Step 4: BSP vmlinux (already built for module experiment in 11.5)

# Step 5: Three-way symbol comparison
$ awk '{print $3}' /tmp/efg-kallsyms.txt | sort -u > /tmp/efg-syms.txt
$ nm ~/efg-build/marvell-bsp/linux-yocto-cnxk-5.15/vmlinux 2>/dev/null \
    | awk '{print $3}' | sort -u > /tmp/bsp-syms.txt
$ nm ~/efg-build/vanilla-5.15.72/linux-5.15.72/vmlinux 2>/dev/null \
    | awk '{print $3}' | sort -u > /tmp/vanilla-syms.txt

$ wc -l /tmp/*-syms.txt
 115998 /tmp/bsp-syms.txt
 120399 /tmp/efg-syms.txt
 112581 /tmp/vanilla-syms.txt

# Step 6: Find symbols in EFG kernel but not in either public source
$ comm -23 /tmp/efg-syms.txt \
    <(sort -u /tmp/vanilla-syms.txt /tmp/bsp-syms.txt) \
    | grep -vE "^(\.L[0-9]+|\.LC[0-9]+|\.LBE|\.LFE|\.LFB|\.Letext|\.Ldebug|\.Lframe|__compound_literal\.|__func__\.|__warned\.|CSWTCH\.)" \
    > /tmp/efg-unique-real-syms.txt

$ wc -l /tmp/efg-unique-real-syms.txt
6357

Filter rationale: The grep -vE pattern excludes compiler-generated local labels (.L<N>, .LC<N>, .LBE<N>, etc.) which differ across every build of every kernel and carry no information about kernel structure. The remaining 6,357 symbols are real exported names, function names, and global variable names.

Top-level breakdown by name prefix:

$ awk -F'_' '{print $1}' /tmp/efg-unique-real-syms.txt | grep -v "^\." \
    | sort | uniq -c | sort -rn | head -20

   2646 (no prefix or various)
    799 drm
    195 bond
    116 tdts
    113 wg
    104 my
     66 fsv
     59 ppp
     51 mlxsw
     46 shell
     45 ubnthal
     44 proc
     44 get
     42 dev
     42 bonding
     33 tm
     32 nf
     30 tcp
     29 pppoe
     27 ppu

Note: the drm count includes graphics driver code that may have come from a different source than vanilla or BSP (Ubiquiti uses Mediatek display panel for the EFG's front-panel LCD). The wg (WireGuard) count likely reflects an upstream backport. The tdts, tm, ubnthal, nf*dpi* numbers are the diagnostic ones.

A.10 — GPL Source Request

The following text was sent to opensource-requests@ui.com:

Subject: GPL Source Request — Enterprise Fortress Gateway (EFG) Kernel Source

I am the owner of an Ubiquiti Enterprise Fortress Gateway (EFG) running firmware version [version], with kernel version 5.15.72-ui-cn9670. Per the terms of GPL-2.0, I am formally requesting the complete corresponding source code for this firmware's GPL-licensed components, including but not limited to:

The complete Linux kernel source tree corresponding to 5.15.72-ui-cn9670, including:

The base kernel source

All patches applied by Ubiquiti and any third parties (Marvell, Trend Micro, etc.)

The kernel build configuration (.config)

The Marvell OCTEON CN9670 BSP drivers (octeontx2_pf, octeontx2_vf, octeontx2_af, rvu_*, NIX, CPT, SSO, NPA)

Source code for any GPL-licensed kernel modules including those tagged with the GPL/GPL-compatible MODULE_LICENSE() declarations

The device tree files (.dts, .dtsi) used by the firmware

The build system, packaging recipes, and toolchain specification (compiler version, flags) sufficient to reproduce the binary

Any other GPL components in the firmware (busybox, systemd, etc.)

Per GPL-2.0 §3, this source must be made available under the same license, in a form accessible to me. Acceptable delivery: a downloadable archive, a public git repository link, or physical media at cost.

[contact details]

The escalation path documented in Section 14.3 applies if no response is received.

Author

galvesribeiro commented May 5, 2026

Added a small comment in the beginning regarding AI usage just for the sake of transparency.

GhostNaix commented May 8, 2026 •

edited

Loading

I came across this whilst researching the EFG and UDM-Beast and to say marvelous work is an understatement.
I was wondering does Suricata in the EFG run on one core on the forwarding core (hence the inspection tax section)? Because I extracted the Suricata config from my UDM SE and it seems that they have a configuration for Suricata running across all cores (assuming they are using the same config in the EFG and UDM Beast). as evident by this:

threading:
  set-cpu-affinity: yes
  cpu-affinity:
    - management-cpu-set:
        cpu: [ 0 ]  # include only these cpus in affinity settings
    - receive-cpu-set: #
        cpu: [ "all" ]  # include only these cpus in affinity settings
    - worker-cpu-set:
        cpu: [ "all" ]
        prio:
          default: "high"
    - verdict-cpu-set:
        cpu: [ 1 ]
        prio:
          default: "high"
  detect-thread-ratio: 1.0

According to Suricata docs:

Runmode AutoFp:

management-cpu-set - used for management (example - flow.managers, flow.recyclers)
receive-cpu-set - used for receive and decode
worker-cpu-set - used for streamtcp,detect,output(logging)
verdict-cpu-set - used for verdict and respond/reject

Runmode Workers:

management-cpu-set - used for management (example - flow.managers, flow.recyclers)
worker-cpu-set - used for receive,streamtcp,decode,detect,output(logging),respond/reject, verdict

Can you please check the file /usr/share/ubios-udapi-server/ips/config/suricata_ubios_high.yaml on the EFG and UDM-Beast and if they have the same threading config as UDM-SE and are you suggesting that ubiquiti move the management-cpu-set to another core (i.e core 4 or core 2 or some other core)?.

mmx01 commented May 8, 2026 •

edited

Loading

I looked at UCG Fiber which is 5x cheaper. Does it mean they have different sw stack to achieve such result?

https://www.youtube.com/watch?v=YKsh_Dg0myU&t=2s

Inter vlan 11Gbps but config seems to follow their standard.

root@UCG-ironionet:~# iptables -L FORWARD -n -v --line-numbers

Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
num pkts bytes target prot opt in out source destination
1 265K 111M TOR all -- * * 0.0.0.0/0 0.0.0.0/0
2 276K 114M ALIEN all -- * * 0.0.0.0/0 0.0.0.0/0
3 24M 19G IPS all -- * * 0.0.0.0/0 0.0.0.0/0
4 24M 19G UBIOS_FORWARD_JUMP all -- * * 0.0.0.0/0 0.0.0.0/0
root@UCG-ironionet:# lsmod | grep nf_flow_table
root@UCG-ironionet:#

athurdent commented May 8, 2026 •

edited

Loading

I looked at UCG Fiber which is 5x cheaper. Does it mean they have different sw stack to achieve such result?

No, QCA hardware acceleration.

Author

galvesribeiro commented May 8, 2026

@GhostNaix Thank you for pointing it out and I owe you several corrections plus one substantial finding that came out of your question.

On the threading config: confirmed identical to what you found on your UDM-SE. The EFG ships the same suricata_ubios_high.yaml with management on core 0, verdict on core 1, workers on [ "all" ], plus isolcpus=12 on the kernel command line. But the architecture isn't what either of us initially thought. I went back and forth on this several times in my own thinking — first I claimed pure pcap, then overcorrected to "hybrid pcap+NFQUEUE," then to "two distinct modes switched by UI toggle." All of those were wrong. The actual architecture only became clear once I toggled "Intrusion Prevention" on in the UniFi controller and re-checked everything:

$ ps -ef | grep suricata | grep -v grep
/usr/share/ubios-udapi-server/ips_6/suricata/bin/suricata --pcap
    --pidfile /run/suricata.pid
    -c /usr/share/ubios-udapi-server/ips_6/config/suricata_ubios_high.yaml

$ iptables-save | grep -i nfqueue
[empty]

$ cat /proc/net/netfilter/nfnetlink_queue
[empty]

The IDS/IPS toggle does not change Suricata's launch mode. Suricata always runs --pcap on the EFG. The nfq: block, the verdict-cpu-set: [ 1 ] pinning, and the mode: repeat setting in the YAML are all dead config — they would activate only if Suricata were launched with -q <queue>, which never happens on this device.

What "IPS" actually does on the EFG, traced end-to-end:

Suricata runs in RunModeIdsPcapWorkers with six worker threads (one per bridge: br0, br254, br3, br5, br6, br7) and 32,033 signatures loaded.

When a signature matches, Suricata's in-process closed-source plugin ubnt-idsips-daemon.so writes an alert datagram to a UNIX socket at /run/ips/eve_alert.json.

A separate userland daemon ubnt-idsips-daemon reads those datagrams and (when IPS is toggled on) populates a kernel ipset named ips of type hash:ip,port,ip via netlink.

The IPS iptables chain matches against this ipset and drops with IPSLOGNDROP.

The ipset itself:

$ ipset list ips
Type: hash:ip,port,ip
Header: family inet hashsize 1024 maxelem 65536 timeout 0
Number of entries: 0

Three things stand out: blocking is per-flow-tuple (source IP / destination port / destination IP) rather than per-source-IP, entries never expire, and on this EFG with 8 days of uptime / IPS enabled / multi-VLAN traffic / 32K signatures loaded, the ipset is empty across multiple samples.

So the IDS/IPS toggle in the UniFi controller controls one specific behavior in a closed-source userland daemon: whether it populates the ipset when alerts fire. Both modes use the same Suricata invocation, the same --pcap capture, the same workers, the same alerts. The difference is policy in ubnt-idsips-daemon, not architecture.

This is more accurately described as delayed reactive blocking than Intrusion Prevention. The first packet matching a signature always reaches its destination — Suricata observes via pcap, alerts, the daemon parses the alert, the ipset gets populated, and the next matching tuple is dropped. An "IPS" that observes an SQL injection payload, alerts, and then blocks future traffic from the source IP — after the original payload has already reached the database server — has not prevented the intrusion. It has prevented follow-up traffic.
So your original concern about a single-core verdict bottleneck on core 1 doesn't actually manifest on the EFG, because there are no verdicts at all. But there's a different concern that's arguably more impactful: an "Intrusion Prevention System" that does not prevent the intrusions it observes.

The architectural reason this design exists is interesting in its own right: properly inline IPS via NFQUEUE would put every inspected packet through Suricata's worker threads with verdict reinjection through the configured verdict-cpu-set: [ 1 ] core. On the EFG's 2 GHz Octeon cores, this would significantly worsen the inter-VLAN forwarding throughput documented in the writeup. By doing IPS retroactively via ipset population, Ubiquiti avoids creating a hard single-core verdict bottleneck on the data path — but at the cost of the IPS not actually preventing the malicious traffic it detects.

The trade-off makes performance sense; it does not make security sense.

On your specific question — should management-cpu-set move off core 0? Yes, particularly for single-flow inter-VLAN tests where core 0 is the typical RSS landing spot. One-line YAML change with measurable benefit on single-stream throughput. Equally useful: pinning dpi-flow-stats away from RSS-dominant cores via taskset or systemd CPUAffinity. I checked dpi-flow-stats's affinity:

$ taskset -p $(pgrep -f dpi-flow-stats)
pid 3550's current affinity mask: 3ffff

Mask 0x3ffff = all 18 bits = no CPU pinning at all. dpi-flow-stats runs anywhere the scheduler places it, with sustained ~40% CPU consumption. It's actually the bigger forwarding-core contender than Suricata in this configuration.

Both core-pinning fixes are 10-20% wins on single-stream throughput at best, and they're worth doing. But the bigger picture in the writeup remains: even with optimal CPU pinning, single-stream forwarding ceilings are set by per-core kernel-stack throughput on the EFG's 2 GHz Octeon cores. Architectural fixes (flowtable, DPDK + VPP, Suricata 7.0+ in DPDK mode for actually-inline IPS) move those ceilings 5-25×. Core-pinning tweaks move them 10-20%. Both are worth doing; only the architectural fixes change what the device is fundamentally capable of.

Clarification on the "single core" phrasing in the writeup: I may have expressed myself wrongly with that wording. What the lab tests demonstrate isn't "all userspace inspection on one specific core" — it's that for a single-stream inter-VLAN test, the bottleneck IS one specific core (whichever core RSS hashes the flow to, typically core 0). Userspace processes that consume cycles on that specific core directly reduce that flow's throughput. The tests narrow to that one core because that's what matters for inter-VLAN single-stream — which is the user-visible failure mode (Veeam backups, single large file transfers, individual users on a Fast.com test). For multi-flow workloads work spreads across cores via RSS hashing and the contention is less visible, but Ubiquiti markets 25 Gbps WAN/LAN, and what users measure is single-stream. Apologies for the imprecise wording — Section 4.5 is now updated.

One more finding that came out of your question that triggered another move on my end: The Suricata version (because the path you pointed doesn't exist on EFG). Verified directly:

$ /usr/share/ubios-udapi-server/ips_6/suricata/bin/suricata -V
This is Suricata version 6.0.12 RELEASE

Suricata 6.0.x went end-of-life on August 1, 2024 per the upstream project. The official statement: "This means we'll be providing no more support, releases or (security) fixes for this branch. We strongly encourage everyone who is still using Suricata 6 or older to upgrade to Suricata 7 as soon as possible." The final 6.0.x release was 6.0.20.

The EFG is running Suricata 6.0.12 in May 2026 — that's 21 months past upstream EOL, 8 patch releases behind even the final 6.0.x release, and over three years old. Receiving zero security fixes since August 2024.

But here's the bigger finding: Suricata 8 is also on the device, fully staged, ready to run.

$ ls -la /usr/share/ubios-udapi-server/
drwxr-xr-x  2 root root  4096 Apr 22 21:09 ips/      ← 68-byte version selector
drwxr-xr-x  1 root root  4096 May  2 20:55 ips_6/    ← Suricata 6.0.12 (EOL, ACTIVE)
drwxr-xr-x  6 root root    81 Apr  8 06:24 ips_8/    ← Suricata 8.0.2 (current, INACTIVE)

$ /usr/share/ubios-udapi-server/ips_8/suricata/bin/suricata -V
This is Suricata version 8.0.2 RELEASE

$ ls /usr/share/ubios-udapi-server/ips_8/config/
afpacket.tmpl   category_list.json   iface.tmpl
reference.config   static_config.json   suricata_ubios_high.yaml

The ips_8/ directory is not a placeholder. It contains a working Suricata 8.0.2 binary, complete config templates, the same suricata_ubios_high.yaml filename used by the active ips_6/, and the full packaging structure. Suricata 8 also introduces a --firewall mode that would architecturally replace the current iptables IPS chain + ipset pattern entirely. The minimal ips/ directory contains only version.json (68 bytes) — likely a version selector that decides which ips_N/ directory the running daemon points at.

So this isn't a "haven't gotten around to upgrading" situation. The upgrade is on every shipping EFG, and Ubiquiti has actively chosen to point the version selector at the end-of-life binary. The likely reason is that the closed-source ubnt-idsips-daemon.so plugin is built against Suricata 6's plugin API and would need porting to 8.x — but that doesn't change the situation for customers, who are running an unsupported inspection engine on a $2,000 enterprise security gateway while the supported version sits unused on the same device's filesystem.

Bonus closed-source surface area worth flagging: two pieces of closed-source Ubiquiti code interact with GPL software in this pipeline. (1) ubnt-idsips-daemon.so is a Suricata plugin loaded as a .so into the Suricata process — runs in Suricata's address space, links against Suricata's exported plugin API. (2) ubnt-idsips-daemon is the separate userland daemon that consumes Suricata's alert socket and writes to the kernel ipset. Suricata is GPL-2.0-licensed; whether the plugin is a derived work is the same question raised in the writeup's Section 14 about the proprietary kernel modules.

Section 4.5 has been updated with the full architecture (with a flow diagram), the retroactive-blocking finding, the Suricata version / EOL details, and the staged-but-inactive Suricata 8.0.2 alongside it. The table in 4.8, Finding 8, and Recommended Fix 5 are also updated.
Thanks again — this thread has produced four discrete findings (architecture clarity on the IPS pipeline, the retroactive-blocking semantics, the EOL Suricata, and the staged-but-unused Suricata 8) that materially sharpen the writeup. Exactly the kind of substantive engagement these documents benefit from.

In other words - Sorry for the bad wording on that point. I meant single core because we were testing on the single core, so this effectively affect the forwarding capabilities on that particular core. I hope that clarifies it, I've updated the post accordingly and again, thanks for pointing it out!

Author

galvesribeiro commented May 8, 2026

@mmx01 the software stack is mostly the same. But I guess they are using the Qualcomm kernel from their BSP which probably has some acceleration features enabled (like the Marvel does) and for some reason they decided to not disable them.

GhostNaix commented May 8, 2026

@galvesribeiro Thank you for your very detailed response, I didn't realise that the path I provided no longer existed on the EFG (The path was found on my UDM SE and I assumed the EFG would have the same since the path was the same for the UDR too, sorry didn't have access to the EFG), I appologise for my mistake (I see why they renamed the folder from ips to ips_6 and ips_8 to represent suricata versions and also possible config diffreances). Also I am trying to think of the rationale behind ubiquiti rolling it's own IPS system which seems to be very flawed instead of using suricata's own NFQUEUE (You would think their engineering team would know this right?). According to suricata docs wouldn't a viable solution be using NFqueue but with the bypass option so that the user can switch back and fourth between IDS and IPS mode?

So there must be something they know that we don't for them to come up with such a solution otherwise, Occam's razor would put it at pure incompetence and ignorance (This would also potentially mean that the entire engineering team did not RTFM otherwise they would have known NFQUEUE existed).

Another question for you: Since you have access to the EFG, can you please put a linux box behind the EFG and run the command curl -s https://testmynids.org/uid/index.html (https is intentional) to see if they somehow got suricata to work with the TLS/SSL Decryption (Theres also an interesting quirk that theres a strange delay about 1-5 minutes before the signature detection shows up in the unifi dashboard. I observed this on my UDM SE but unsure if it happens in the EFG)? This should trigger the signature GPL ATTACK_RESPONSE id check returned root if they have some how managed to hook up suricata to TLS/SSL Decryption. If not then it's another potential blindspot/flaw with the EFG.

mmx01 commented May 8, 2026 •

edited

Loading

same here, running process:
/usr/share/ubios-udapi-server/ips_6/suricata/bin/suricata --pcap --pidfile /run/suricata.pid -c /usr/share/ubios-udapi server/ips_6/config/suricata_ubios_high.yaml

drwxr-xr-x 2 root root 4096 May 1 06:29 ips/
drwxr-xr-x 1 root root 4096 May 1 06:29 ips_6/
drwxr-xr-x 6 root root 81 Apr 22 20:33 ips_8/

also the kernel. still old
Linux UCG-ironionet 5.4.213-ui-ipq9574 #5.4.213 SMP PREEMPT Wed Apr 29 01:23:52 CST 2026 aarch64 GNU/Linux

root@UCG-ironionet:~# lsmod | grep nss
qca_nss_sfe 1273856 1 ecm
qca_nss_ppe_lag 20480 0
qca_nss_ppe_ds 24576 0
qca_nss_ppe_qdisc 102400 0
qca_nss_ppe_pppoe_mgr 16384 0
pppoe 24576 3 qca_nss_sfe,ecm,qca_nss_ppe_pppoe_mgr
qca_nss_ppe_bridge_mgr 32768 0
qca_ovsmgr 45056 3 qca_mcs,ecm,qca_nss_ppe_bridge_mgr
qca_nss_ppe_vlan 49152 2 qca_nss_ppe_lag,qca_nss_ppe_bridge_mgr
qca_nss_ppe_vp 69632 3 qca_nss_ppe_vlan,ecm,qca_nss_ppe_ds
qca_nss_dp 147456 2 qca_nss_ppe_vp,qca_nss_ppe_ds
bonding 135168 3 qca_nss_ppe_vlan,ecm,qca_nss_ppe_pppoe_mgr
qca_nss_ppe 380928 9 qca_nss_dp,qca_nss_ppe_vp,qca_nss_ppe_vlan,qca_nss_ppe_lag,qca_nss_ppe_qdisc,ecm,qca_nss_ppe_bridge_mgr,qca_nss_ppe_ds,qca_nss_ppe_pppoe_mgr
qca_ssdk 2191360 4 qca_nss_dp,qca_nss_ppe

and indeed:
root@UCG-ironionet:~# cat /sys/kernel/debug/qca-nss-ppe/stats/common_stats | grep flows
[v4_l3_flows]: 174
[v4_l2_flows]: 0
[v4_vp_wifi_flows]: 0
[v4_ds_flows]: 0
[v6_l3_flows]: 0
[v6_l2_flows]: 0
[v6_vp_wifi_flows]: 0
[v6_ds_flows]: 0

Author

galvesribeiro commented May 8, 2026 •

edited

Loading

@GhostNaix Thank you — and thanks for remind me about the TLS inspection/decryption which is handled by their EFG NextAI feature, not by Suricata. I've updated the post with more info relevant to how NextAI works and why this whole solution is messed up.

The curl test result, on the EFG with IPS enabled and 32,033 signatures loaded:

$ curl -s https://testmynids.org/uid/index.html
uid=0(root) gid=0(root) groups=0(root)

$ tail -100 /var/log/suricata/eve.json | grep -i "GPL ATTACK"
[empty]
$ tail -100 /var/log/suricata/fast.log | grep -i "GPL ATTACK"
[empty]
$ journalctl -u syslog-ng | grep -i "attack_response"
[empty]

No alert anywhere. The signature payload reached the test host through the EFG without detection. As you suspected, this is the TLS blind spot — Suricata in pcap mode sees ciphertext on the wire and can't match HTTP-body signatures against encrypted bytes.

On NextAI is where this would work and sustain your question on "why they did that?": packets enqueued to a RabbitMQ broker running on the EFG itself, dequeued by a proprietary SSL-inspection process, decrypted with a Ubiquiti-distributed CA cert, re-encrypted, re-enqueued, then forwarded — is significantly worse than the retroactive-ipset model we'd been discussing. RabbitMQ is an Erlang-based AMQP broker designed for inter-service messaging at millisecond timescales. Per-packet routing through AMQP imposes TCP framing, routing-key matching, persistence semantics, and Erlang VM scheduling decisions on every packet. That's fundamentally incompatible with a multi-Gbps data path.

Two consequences fall out of that:

First, NextAI's actual achievable throughput is well below what the wire would support, regardless of what the silicon could deliver — the AMQP broker is the bottleneck. The architectural choice itself caps the throughput.
Second, NextAI only inspects flows that meet several conditions simultaneously: (a) the client has the Ubiquiti CA installed; (b) the flow is routed through the NextAI pipeline; (c) the flow's volume fits within whatever RabbitMQ can sustain. The curl test above succeeded over HTTPS without certificate errors, which means NextAI evidently let the traffic pass without interception — I never used nor recommend anyone to use that crap. Only enabled it once to see the performance impact and how bad it was but disabled right after. That's not a misconfiguration on our end; it's NextAI's design. Any client without the Ubiquiti CA — BYOD, IoT, mobile devices, anything with cert pinning, anything Ubiquiti hasn't pre-imaged — bypasses NextAI inspection entirely.

So the IPS coverage on HTTPS is the intersection of: (clients with the CA installed) ∩ (flows routed through NextAI) ∩ (flows surviving the broker throughput). For most enterprise networks in 2026, that intersection is a minority of the traffic.

On the point about the 12 Gbps with IPS marketing claim — the writeup's framework already explains why. Their claim is "12 Gbps with IDS/IPS, measured with internet traffic." But internet traffic from LAN to WAN crosses a subnet boundary, hits NAT, and traverses the same single-core kernel-forwarding path the writeup documents for inter-VLAN — plus NAT mangling, plus PPPoE encapsulation if the WAN is PPPoE, plus all the iptables FORWARD chains, plus dpi-flow-stats and Suricata pcap copies. The 12 Gbps figure is only achievable as aggregate multi-flow throughput with TSO/checksum offload and CPU spread across cores via RSS. Single-stream LAN→WAN throughput with IPS enabled is bounded by the same per-core kernel-forwarding ceiling as inter-VLAN — typically 1-2 Gbps. The marketing number describes a benchmark methodology, not what users actually experience on real workloads (Veeam backup, large file transfer, CRM bulk operations).

I've added the 12 Gbps debunking to the TL;DR and added the TLS visibility test plus the NextAI architecture description to Section 4.5.

In other hand, Suricata 7.0+ on DPDK runs as a pipeline stage on the dataplane workers. With DPDK + VPP + Suricata-on-DPDK and TLS interception integrated as a pipeline stage (using QAT for offloaded crypto, or kernel TLS offload, or just CPU AES-NI on dedicated cores), the packet flow is: NIC RX → DPDK worker → VPP plugin chain → optional TLS decrypt → Suricata signature match → optional TLS encrypt → DPDK TX. No RabbitMQ. No kernel→userspace copies. No retroactive ipset. No 5-minute alert latency. The throughput overhead of TLS-aware inline IPS in this architecture is in the low single-digit percent on modern hardware, not the order-of-magnitude penalty of the EFG's current design.

This is what Marvell themselves publish as the reference architecture for the OCTEON CN9K silicon family the EFG runs on. It's not invention; it's integration. The writeup's central recommendation is that Ubiquiti adopt this reference architecture rather than continue stacking proprietary kernel modules and RabbitMQ-mediated decryption pipelines on top of a stock Linux 5.15 kernel network stack.
Thanks again for the NextAI context — it materially strengthens the writeup's architectural argument by showing that the EFG's TLS-aware inspection pipeline is even further from what Marvell's silicon was designed to do than the IDS/IPS pipeline alone suggests.

Author

galvesribeiro commented May 8, 2026

@mmx01 yeah - this is the only device they have any sort of acceleration enabled apparently, regardless of new or old platforms. It makes me believe that this came all pre-enabled on the BSP and reinforced my original thought that they just ship whatever the default BSP has.

scyto commented May 8, 2026

on my Beast both the 6 and the 8 files exist at these paths, adding here for reference only (and ues i realize there are only 2 actual files ;-) )

root@Home:~# find / suricata_ubios_high.yaml | grep suricata_ubios
/mnt/.rofs/usr/share/ubios-udapi-server/ips_6/config/suricata_ubios_high.yaml
/mnt/.rofs/usr/share/ubios-udapi-server/ips_8/config/suricata_ubios_high.yaml
/usr/share/ubios-udapi-server/ips_6/config/suricata_ubios_high.yaml
/usr/share/ubios-udapi-server/ips_8/config/suricata_ubios_high.yaml

athurdent commented May 9, 2026

@galvesribeiro did you run your tests on actual UDM Beast hardware, or is that some kind of simulation, or perhaps a modded device?
You found 32GB RAM and a 2.5 GHz CPU according to your table:

According to the tech specs from the UI site you could only have found 16GB RAM and a 2.1 GHz CPU

The below part doesn't add up for me, different result here when running top and looking at cpss-manager, so can you please show the actual values your calculations are based on?

I ran this through an AI and it assumed 2 scenarios:

Author

galvesribeiro commented May 9, 2026 •

edited

Loading

Yes, the device was a pre-release device. Although important clarify the irrelevance of this difference since the CPU in the production device has about 20% lower clock and the memory is irrelevant so your "astute reader" skills are not relevant in this case.

Regarding the numbers - 75% was a snapshot over a single "frame" per perf while the 6% was smoothed across multiple runs so yeah, I'm glad your AI can do math but again, irrelevant for the discussion.

That is all I can say.

athurdent commented May 9, 2026

I missed the note in the article stating it was a pre-release device you were comparing with.
Given the fact that Ubiquiti EA ToS forbid discussion of EA outside their forums, unsure about them appreciating someone running e.g. Claude Code on pre-release hardware likely covered by an NDA.
Anyways, you should clarify that you were not testing on production hardware there. Only Ubiquiti would know all of the differences between that specific pre-release model and a GA device, so the real-life results may well vary. As I said, my CPU usage regarding cpss-manager looks different.

Author

galvesribeiro commented May 9, 2026 •

edited

Loading

@athurdent

As proofed my statements were before with the UNASP8 on pre-release and GA hardware and sofware, and the fact that I didn't had any ToS or NDA when the tests were executed, I don't have any legal obligation to do anything. I'm aware of what I do and when I do but thank you for your attempts to as always fanboy on anything Ubiquiti related.

In regards to non-production hardware and production hardware don't worry - the tests were run later by one of the readers with similar results, so once again, what I said about pre-released hardware and software confirmed itself by itself.

Unless you have anything relevant about the problems detected by this post, or technical contributions like the one made by the other folks here, I suggest you stop replying. Don't few obligated to. Nobody wants you here. I'll be obligated to delete your messages since they are not relevant to the subject. I don't want this post to become the toxic environment Ubiquiti forums has become because of people like you.

Please refrain to post here. Fair warning.

GhostNaix commented May 9, 2026

@galvesribeiro Thank you again for your extremely detailed response. In regards to my question "Why they did that?" I was referring to why would they use ubnt-idsips-daemon instead of NFQUEUE. I apologize if it wasn't clear in the initial post.

I'd like to give the ubiquiti engineers the benefit of the doubt and say they had a particular and valid reason for this but given all this evidence I'm not so sure anymore.

I was also hoping that maybe the solution of using ubnt-idsips-daemon was to some how tied to there usage of Suricata with the TLS/SSL Inspection however given your findings, I now have to accept the possibility of Occams razor mentioned in my previous post.

As a result I deem the flaws with the EFG unacceptable for my use case and will postpone my upgrade.

mmx01 commented May 9, 2026 •

edited

Loading

As a result I deem the flaws with the EFG unacceptable for my use case and will postpone my upgrade.

I am holding off with move to UDM Beast also, it would be very bad experience to see it performs worse than 5x cheaper UCG Fiber I currently own. Hopefully either we see update and release of source code so that we could compile and load our own modules or the product gets fixed by the manufacturer (like IMO it should to meet advertised specs).

When I initially asked on reddit about lack of relevant inter vlan routing performance tests for UDM Beast, I was faar from knowing/finding out this thread and that it is actually not a new issue cascading from EFG-like approach. With this knowledge, it does not seem to me accidental at all that early release reviewers kind of missed to test/present inter vlan performance results.

Author

galvesribeiro commented May 9, 2026

@GhostNaix no need to apologise to anyone and I understood your question. Also, I think myself and many others are not here to shit on Ubiquiti's head. They have an awesome hardware and idea as I've tirelessly said. However that hardware is not (1) delivering what was advertised because of software and (2) we are not being blindness fanboys that says "everything is alright" for the sake of keep receiving free gear.

I've spent myself, for myself, and my companies hundreds of thousands of dollars in Ubiquiti gear the past couple of years. I was really invested on their ecosystem, and all I wanted, and had bet on, was for them to improve the products and deliver on the class or products they advertise and the target audiences. Proof of that, is that nothing I've show in this gist is new for me. It's been on Ubiquiti's hands for about a year thru a long support thread where I explain in much more details what I've said here. I just ran thru the investigation again, with current software/hardware and made it public here so people don't fall on the same mistake I did.

As said in multiple occasions on their own forums and on Discord, if they ever get this situation fixed, and actually fix their problems, I'm be more than glad to get back to buy their products again.

@mmx01 it is important to note that the doc here were very specific about what exactly is wrong and which situations those products are having issues. If your specific use case doesnt fit the requirements you will probably be happy with the EFG or UDMB. For example, if you have a previous UDM, or one of the smaller gateways and you run other apps like Protect, Talk, etc, you will definitively see a performance improvement on those apps on the UDMB. The complains I have here are specifically on the Networking part of it which for my use case, is the critical one. But even considering the Networking, if your network is based on small size traffic spread among many users, which means you have many streams, you will probably not feel those problems. But if you have the traffic pattern described here, then, you will suffer indeed.

mmx01 commented May 9, 2026 •

edited

Loading

@galvesribeiro , I was pointed here by someone on reddit as my use case is primarily single stream high throughput - most commonly - real time video ingestion, transfers for processing and AI inference. It matches your tests as far as I understand. I use it only for networking, no talk, protect or anything like this. There I have Blue Iris and other solutions.

If that would be without any security considerations my Mellanox SX6012 can do L3 inter vlan at 56Gbps line rate as I have both hosts on 2x56DACs, but it misses all L3 security features completely, not even supports vlan separation and that is a big problem for compliance of what I need (am required) to do. Also I cannot put all "actors" in the same vlan, they need separation.

For the moment I do Internet, IPSEC and some IOT/OT basics via UCG Fiber, heavier stuff routes via NGFW Sophos XG virtualized but that is fragmented architecture and hard to manage changes and threats. It works but it is more complicated than it should.

I looked at UDM Beast to consolidate everything on it (possibly with HA like with Sophos but in hardware) and simplify at 25Gbps IPS/IDS it would come as perfect fit and acceptable compromise with even reasonable price for that promise. Waiting idle for 8 hours versus 4 hours is a huge difference - and cost. Also one of my thoughts was if it could do HA active - active and possibly aggregate 50Gbps, that was just an idealistic thought (not an expectation) and probably in the best version would fall short of LACP which is already limited by hash mechanism and not just x2 performance. Yet it cannot deliver 25Gbps so the rest is history.

Before I experimented with Linux DPDK/VPP, Mikrotik CHR and all have good performance but actually not unified management especially when L3 routing touches any security features, be it isolation and/or inspection/detection. For IPS/IDS often these solutions require to offload to another 3rd party software/appliance which then need separate configuration/management. Integration is what UniFi does nicely in its UI and ecosystem seems interesting. I wanted this at scale with competent hardware.

Time to benefit should not be impacted by time to orchestrate :) but for me it is. I am small independent contractor so many things are out of reach for me to be frank, but UDMB comes close, if HW delivered on 25Gbps IPS/IDS promise a no brainer for me.

Author

galvesribeiro commented May 9, 2026 •

edited

Loading

@mmx01 I feel you. That is why many of us bought Ubiquiti and invested in the ecosystem. The "Unified" experience is really cool and nobody at that price point would get even close and I've acknowledged that.

However, unlike other say when they have 0 argument and still try to make some "pulled from my 4ss" point to stick: "here's a reason we're not paying 10 times the price for those 25G gateways (plus license and software support fees)" - that I don't believe. Because something is cheaper, doesn't mean they can advertise something and mislead people that it will actually do it. And by the way, we DO pay Ubiquiti support. Good luck trying to resolve a performance problem like those with the free support and they ask you to "run an air scan" or even the PRO/PAID support when they tell you that shutdown the power thru the PDU is not enough and you have to physically unplug the device. :)

The whole point here is performance, and how they are not achieving it. People are mislead to believe that rainbows unicorns and lollipops when using Ubiquiti. Believe me, it is not. And that is mostly because the brainless fanboys out there OR, even worse, the ones which don't want to arg against the product problems for the sake of keep receiving it for free, rush to battle anyone out there (I'm not the only one) that have no ties and is actually saying the truth about the products by doing proper research. I wish I had done that before and have not bet in the company so hard like I did expecting them to get better. But that is on me.

Can they get there? As I said, and proofed, yes, absolutely and I root for them. If that time comes as I said, I'll buy Ubiquiti again immediately for my companies but until them, I've returned 100k+ in purchases last month with a disaster that happened by the ECS-A switches + the UCK-E which they acknowledged that was a problem and yet havent fixed months in a row. I hope they do get better tho.

In regards to your environment yeah, I can imagine your situation. One of my companies is a Game Studio. We move A LOT of data at very high speeds and low latency all day. Every machine has a dual Mellanox 25G NIC and we are already quoting/designing a network upgrade to dual 100G since this is becoming a bottleneck. On that same company, we just dropped/returned a whole rack full of Ubiquiti gear + a bunch of gear in their boxes for the other two racks in our DR datacenter. We ended up moving the gateways to Mikrotik 100G gateways which actually can perform 100G without sweating and have the active active true HA behavior and doesn't suffer from the EFG/UXG-E problems and to Dell Switches. Unlike been said, it has cost way less the overall solution for this deployment, and I have 0 issues with it.

if HW delivered on 25Gbps IPS/IDS promise a no brainer for me.

I agree, but don't hold your breath on that. I'm sure the hardware can do it but the state it is now, even the GA hardware/software, you will not get ever close to that.

Also a friendly reminder: IPS in Ubiquiti space is not real IPS, it is misleading. It is retroactive blocking and as shown will be useless on HTTPS traffic unless you have the NeXT AI enabled (please for your sake don't). Meaning that if someone pass a legit HTTPS request with a simple SQL injection it will hit your database and have a return. Only the subsequent frames (i.e. next HTTP requests on the same session) maybe will be blocked. Ofc we expect that application developers are protected against it if they doing their work and applying defense-in-depth concepts on their engineering but there is always chance for mistakes and the gateway will not protect you in this case.

PS: If they had priced the EFG/UXG-E 10x more and/or had a subscription, I would still pay gladly if it worked as advertised.

GhostNaix commented May 10, 2026 •

edited

Loading

PS: If they had priced the EFG/UXG-E 10x more and/or had a subscription, I would still pay gladly if it worked as advertised.

I'm sure you know this but, they do have a subscription. Cybersecure they call it which is actually the Emerging Threats Pro ruleset which has twice the amount of signatures in comparison to the normal unpaid ruleset (Which is actually Emerging Threats Open) for Suricata, but unfortunately given the present issues, it's not working at its full potential.

Also a friendly reminder: IPS in Ubiquiti space is not real IPS, it is misleading. It is retroactive blocking and as shown will be useless on HTTPS traffic unless you have the NeXT AI enabled (please for your sake don't). Meaning that if someone pass a legit HTTPS request with a simple SQL injection it will hit your database and have a return.

@galvesribeiro Also correct me if I'm wrong but, according to your test data as demonstrated here:

The curl test result, on the EFG with IPS enabled and 32,033 signatures loaded:

$ curl -s https://testmynids.org/uid/index.html
uid=0(root) gid=0(root) groups=0(root)

$ tail -100 /var/log/suricata/eve.json | grep -i "GPL ATTACK"
[empty]
$ tail -100 /var/log/suricata/fast.log | grep -i "GPL ATTACK"
[empty]
$ journalctl -u syslog-ng | grep -i "attack_response"
[empty]

Assuming you enabled NeXT AI (Which I assume is the SSL/TLS Decryption feature + Their "AI") and Suricata during the test, the signatures made to catch this was enabled yet did not catch it during the test (Which I am very sure is in the Free Ruleset/ ET Open Ruleset, the signature that was supposed to be triggered has the SID of 2100498. I also assumed that you enabled all ruleset categories in the test). I again assumed that you added the sections listed below this in /usr/share/ubios-udapi-server/ips_6/config/suricata_ubios_high.yaml during the test as it is not there by default thus Suricata would not output to /var/log/suricata/fast.log by default.

outputs:
  # a line based alerts log similar to Snort's fast.log
  - fast:
      enabled: yes
      filename: /var/log/suricata/fast.log
      append: yes

Therefore, since currently the IPS depends on Suricata detecting this, wouldn't this demonstrate that even with NeXT AI enabled IPS is still useless in conditions when threats are hiding in https at the time of writing this?

On another note, does anyone know any other alternative solutions that can do 10 Gigabit/s intervlan routing with full inspection like IDS/IPS with application identification like unifi without a subscription? I'm looking at Sophos firewall home and opnsense but not sure if they can do 10 Gigabit/s intervlan routing with inspection. With opnsense I know theres Zenarmor and ntopng, however they both have a subscription, Zenarmor has limited policies for the home license use whilst I'm not even sure if ntopng can even block application identified traffic.

mmx01 commented May 10, 2026 •

edited

Loading

Sophos home edition has CPU restriction to 4 cores... they have removed RAM restriction but not CPU. If you have powerful enough cores it could perhaps touch that speed same as other issue single vs. multiple data stream but this restriction penalizes high density multi core CPUs with low(er) base clocks. Regular HW with this restriction will not get you there in single stream :(

Author

galvesribeiro commented May 10, 2026 •

edited

Loading

@GhostNaix yes to all assumptions. That requires to also kill (not restart) the suricata process so the watchdog start it again and load the config. I had struggled a bit to find out why I wasnt seeing the log until Claude told me why.

Regarding subscriptions yes, I had the paid support and CyberSecure in my original environment, not on this lab/test tho but for the point of the doc it wouldn’t make any difference. My point was to say that even theoretically assuming the device costed 10x more as someone suggested, I would buy it anyway if it worked. So the assertion people made is that it is cheap, therefore don’t expect it to work. This tautology doesn’t sustain itself. The question was never about money. I didn’t bought Ubiquiti because it was cheaper.

About the alternatives: I used to use BigIP F5 which had IDS/IPS license. But we move from it long time ago. I dont know other solutions which do inside itself. I’d suggest to use something like Mikrotik if you want performance and work as advertised and still secure firewall wise. They dont advertise nor have IPS/IDS but for my usecase, it is irrelevant to be honest. My traffic is filtered before it reaches the gateway/firewall by WAF.

Also if you have 3 public IPs at minimum (if you don’t want double NAT but it is fine with private IP as well), you can put any box that has IPS/IDS on you internet edge, like a DMZ, receive your ingress on it, do filtering/inspection and then forward to the gateway. This way your traffic can be filtered before enter/leave from/to internet, while your local traffic stays within your gateway boundaries which requires interVLAN at proper speeds. Yes, additional complexity but works just fine.

Another way is to have what we have. The traffic ingress comes from CloudFlare, which is filtered at the edge. Then, if an attack happen it never get to my infrastructure. You can use a tunnel with cloudflared or a simple public IPs rule allowing traffic only from cloudflare IP.

So all this to say that you don’t need IDS/IPS strictly on your network depending on your use case, let alone the ancient (and badly implemented) way of do TLS inspection which NeXT AI do.

GhostNaix commented May 10, 2026

Sophos home edition has CPU restriction to 4 cores... they have removed RAM restriction but not CPU. If you have powerful enough cores it could perhaps touch that speed same as other issue single vs. multiple data stream but this restriction penalizes high density multi core CPUs with low(er) base clocks. Regular HW with this restriction will not get you there in single stream :(

Ah, I see... well it doesn't seem there is a suitable solution then. By the way is it even possible to get a business license on a Sophos Firewall box that you built your own hardware or do you have to buy an appliance? I also assume that the recurring cost is also not suitable for a homelab customer?

GhostNaix commented May 10, 2026 •

edited

Loading

@galvesribeiro Ah I understand your point regarding subscriptions completely now. Also thanks alot for your insight.

For my use case I do deem inspection between VLANs necessary thus I so need IDS/IPS at the network level.

Author

galvesribeiro commented May 10, 2026

@GhostNaix you got most of the reasons why many companies do not use it nowadays already and why I don't like it. It is a security approach introduced many decades ago, where every company had their Squid proxy deployed to inspect and rule traffic of everything. As you said privacy (even in work environments) is a thing, and many countries it is unlawful to use such approaches nowadays.

By definition, it requires that local CA which is annoying to maintain, hard to distribute, etc. Ofc, if you have a MDM or Active Directory in place you can easily distribute the certificates using it policies so the user doesn't even know. Ubiquiti did a good job trying to simplify it if you use their Identity app on each machine.

However, as you said yourself, many applications break with it. Bank websites, anything that has SNI and/or validate the issuer will fail the handshake. Also, the majority of the malware and attackers know they may be in a monitored environment so they avoid the traditional HTTP POST to send data. They use raw TCP sockets, WebSockets (which after the initial POST is promoted to a regular socket) or worse, UDP frames, so the TLS inspection is basically useless.

I understand ofc that there are scenarios where you want to get that level of inspection but I think there are more modern alternatives. For example, endpoint security. Most of the modern security companies have agents running on the client machines which prevent anything wrong to even leave the machine to the network. That also have the cost of distribution but nothing needs to be decrypted or faked. It runs local, usually hooked at kernel space, inspecting both user and kernel space for malicious executions. I believe this it is much more effective than a MIDM approach which is only covering HTTPS traffic.

My point is: although I agree with you that there are some use cases that this may be legit, there are more modern alternatives nowadays, and this should not by any means drive the design of a gateway such that it can become useless as it stands. If you look closely on the problems I reported, you can see a pattern where all the problems "fit" together for the end goal, and big part of it is the traffic inspection one way or another. The user would be penalized whether they use the whole thing or not.

It would be better if Ubiquiti had spent time on developing an agent inside the Identity app (or something dedicated for that) which do the inspection locally based on selective rules on each machine. Leave the gateway alone on this and optimize the hell out of it. It is a win-win situation. Who needs inspection, enable it on the clients and push a policy (easily done for Windows - GPOs - and MacOS - profiles - nowadays, and doable on Linux - specialized policies). But again, as we saw for ourselves, they took the easy way apparently and used 90's tech to the problem and slapped a "AI" label on it.

Author

galvesribeiro commented May 10, 2026

Ah, I see... well it doesn't seem there is a suitable solution then. By the way is it even possible to get a business license on a Sophos Firewall box that you built your own hardware or do you have to buy an appliance? I also assume that the recurring cost is also not suitable for a homelab customer?

Never used Sophos in production. We had a quote from them many years ago but their features didn't fit with what we expected so we moved on.

GhostNaix commented May 10, 2026 •

edited

Loading

Ah thanks @galvesribeiro. For anyone curious about that response, I asked @galvesribeiro about why they were against TLS inspection and wanted to pick their brain about it. I then deleted it because I thought I was stepping on some toes but turns out I wasn't quick enough and they answered it anyway.

One more question for you @galvesribeiro are you willing to run future tests for future ubiquiti top end gear, say if they come out with like the EFG 2 because at this point we really can't take them for face value anymore?

Author

galvesribeiro commented May 10, 2026

Another approach to protect the servers in this case is to have proper policies. We have policies everywhere inside our kubernetes cluster so traffic only leaves or get in if it is white listed on a per pod basis, the pods are rootless, etc. So there are many other approaches which prevent those situations without requiring those legacy approaches to protection.

mmx01 commented May 10, 2026 •

edited

Loading

It is possible to get a license for VM deployment but it will be hard to justify the cost for just homelab.
https://docs.sophos.com/central/customer/help/en-us/LicensingGuide/FirewallLicenses/SFOSLicensingModel/index.html

You may want to write to them and state your case/ask options. Maybe... you never know without trying first. At minimum you would need base + network (individual subscriptions) but that is off topic here.

galvesribeiro/EFG-Broken.md

Why Your Ubiquiti EFG Can't Push 25 Gbps Inter-VLAN — and What's Actually Going On

Or: How I Reproduced the Problem on x86, Tried to Load the Missing Modules on the Real Device, and What That Tells Us About Ubiquiti's Kernel

TL;DR

A note on methodology and AI assistance

Table of Contents

1. The Problem

2. Test Environment

Host Machine ("skywalker")

Reference Device

Lab VM Topology on skywalker

3. Methodology

4. The Reference Run: Real EFG Diagnostics

4.1 — Hardware and Kernel

4.2 — The 5-Deep FORWARD Chain (Smoking Gun #1)

4.3 — No Flowtable. None. (Smoking Gun #2)

4.4 — Conntrack Sized for 10 Million, Currently 846 Used

4.5 — The Inspection Tax (Smoking Gun #3)

4.6 — 18 Cores Sitting Idle

4.7 — Per-VLAN Bridges Instead of VLAN-Aware Bridge

4.8 — Summary of EFG Diagnostic Findings

5. Reproducing the Bottleneck — virtio-net Test Matrix

Test 1 — MTU 9000, offloads on, no rules (best case baseline)

Test 2 — MTU 9000, offloads off

Test 3 — MTU 1500, offloads off (the EFG-realistic baseline)

Test 4 — Adding nf_conntrack module (no rules)

Test 5 — Simple ct rule

Test 6 — EFG-replica 5-chain ruleset (the headline bad number)

Test 7 — 8 parallel streams with EFG ruleset

Test A — Adding nftables flowtable (the magic config change)

Test B — Flowtable + offloads on, MTU 1500

virtio-net Test Summary

6. Closing the Loop — Real Silicon Test Matrix

6.1 — SR-IOV Setup

Test K1 — ConnectX VF, kernel forwarding, MTU 1500, offloads on, no rules

Test K2 — Same as K1, with EFG-style 5-chain ruleset

Test K3 — ConnectX VF, offloads off, no rules

Test K4 — ConnectX VF, offloads off, EFG-style rules

Real Silicon Test Summary

7. Userspace Dataplane — VPP/DPDK Comparison

Test V0 — VPP with virtio-net

Test V1 — VPP with ConnectX VF, offloads off on clients

Test V2 — VPP with ConnectX VF, offloads on the clients (the headline)

7.1 — Estimating VPP/DPDK Throughput on the Octeon Silicon

8. The PPPoE Bottleneck — A Related but Distinct Problem

8.1 — Why PPPoE Is Slow on Stock Linux Routers

8.2 — Direct Evidence on a Production EFG

8.2.1 — Multiple ksoftirqd threads pegged simultaneously

8.2.2 — Concurrent userspace load on the same cores

8.2.3 — Modules confirm pure software PPPoE path

8.2.4 — All hardware offloads disabled on ppp1, with [fixed] flags

8.2.5 — Confirmed: no flowtable module in this kernel build

8.2.6 — MTU discrepancy confirmed

8.3 — How PPPoE Integrates with DPDK Dataplanes

8.4 — The Fix Is Already in Mainline Linux (and DPDK)

8.5 — Estimated PPPoE Improvement

9. Cross-Product Confirmation: UDM Beast and UCG Fiber

9.1 — UDM Beast Hardware Identification

9.2 — The Switch ASIC Is Real and It Is Doing Work

9.3 — Switchdev Is Disabled Across All Interfaces

9.4 — Direct Evidence the L3 Path Is Software-Only

9.5 — The Same iptables Architecture as the EFG

9.6 — Drivers Loaded vs Drivers Possible

9.7 — Inter-VLAN Path on the UDM Beast

9.8 — What This Confirms About the Pattern

9.9 — But Ubiquiti Has Done Hardware-Accelerated Forwarding Elsewhere: The UCG Fiber

9.10 — What This Tells Us About the Pattern

10. Findings: The Architectural Failures

Finding 1: The kernel network stack on a single core has a ceiling around 5 Gbps single-stream when offloads are off, regardless of NIC

Finding 2: Hardware offloads (GRO/TSO/LRO) are the single highest-impact configuration variable

Finding 3: The 5-chain iptables FORWARD pattern costs roughly half your throughput when offloads are also off

Finding 4: nftables flowtable is the missing 3-7× single-stream multiplier

Finding 5: Conntrack helpers are loaded by default, and the per-packet cost is widely misunderstood

Finding 6: Multiple cores do not help single-flow forwarding in the kernel

Finding 7: DPDK on the same silicon delivers 10-25× the throughput, and the vendor ships full DPDK support

Finding 8: Userspace inspection processes contend with the forwarding core, with the contention pattern depending on per-process CPU pinning

Finding 9: Per-VLAN bridges instead of vlan-aware single bridge prevent kernel fast-path optimization

Finding 10: PPPoE WAN performance is bottlenecked by the same kernel stack, with additional encapsulation cost — and worse multi-core spread

Finding 11: The EFG's kernel is binary-incompatible with vanilla 5.15.72 despite identifying as such, and the safety net that would catch this is disabled

11. Recommended Fixes

8.2.4 — All hardware offloads disabled on ppp1, with `[fixed]` flags

13.5 — `tdts` and `t_miner`: Closed-Source Kernel Modules

GhostNaix commented May 8, 2026 •

edited

Loading

mmx01 commented May 8, 2026 •

edited

Loading

athurdent commented May 8, 2026 •

edited

Loading