Skip to content

Instantly share code, notes, and snippets.

@ElXreno
Created March 16, 2026 23:23
Show Gist options
  • Select an option

  • Save ElXreno/9117838f8b449fd1de49450db91c966c to your computer and use it in GitHub Desktop.

Select an option

Save ElXreno/9117838f8b449fd1de49450db91c966c to your computer and use it in GitHub Desktop.
KubeSpan + Cilium Egress Gateway + BPF Masquerade: Making the 'Impossible' Combination Work

KubeSpan + Cilium Egress Gateway + BPF Masquerade: Making the "Impossible" Combination Work

The Problem

Running a multi-site Kubernetes cluster on TalosOS with Cilium as the CNI, we needed:

  1. Inter-node encryption — all traffic between nodes encrypted (nodes communicate over public internet)
  2. Egress Gateway — specific pods' external traffic routed through a gateway node for geo-IP requirements
  3. Talos host firewallNetworkDefaultActionConfig: ingress: block for node-level security

These three requirements created an "impossible triangle":

Cilium WireGuard (encryption.nodeEncryption: true) + Talos Firewall = Broken

Cilium installs CT --notrack iptables rules (in the raw table) for WireGuard-decrypted traffic (mark 0x0D00). This makes return packets UNTRACKED in the kernel's conntrack. Talos's nftables ingress chain has ct state established,related → accept, which doesn't match UNTRACKED packets. The cilium_host interface (where decrypted traffic enters the host stack) isn't in Talos's hardcoded interface whitelist (lo, siderolink, kubespan). Result: TCP SYN-ACK replies are silently dropped by the Talos firewall.

Confirmed with pwru kernel tracing:

nft_do_chain → sk_skb_reason_drop(SKB_DROP_REASON_NETFILTER_DROP)

KubeSpan + Cilium BPF Masquerade = TCP Breaks

KubeSpan (Talos's built-in WireGuard mesh) is incompatible with Cilium's bpf.masquerade: true. Cilium's BPF SNAT runs on eth0 egress, but KubeSpan's nftables OUTPUT chain intercepts packets and routes them through the kubespan WireGuard interface. Return traffic arrives on kubespan, not eth0, so Cilium's BPF reverse-SNAT never fires. TCP connections hang because SYN-ACK packets are never un-SNATed back to the pod IP. (siderolabs/talos#11235)

KubeSpan + Egress Gateway = Mutually Exclusive (supposedly)

Cilium Egress Gateway has a hard requirement for bpf.masquerade: true (validated in pkg/egressgateway/manager.go). KubeSpan requires disabling BPF masquerade. Therefore: KubeSpan and Egress Gateway cannot coexist... or so we thought.

The Solution

bpf.hostLegacyRouting: true breaks the deadlock:

# Cilium HelmRelease values
encryption:
  enabled: false              # KubeSpan handles encryption instead

bpf:
  masquerade: true            # Required for Egress Gateway — KEPT
  hostLegacyRouting: true     # Fixes KubeSpan + BPF masquerade conflict

egressGateway:
  enabled: true               # Works because bpf.masquerade is still true

bandwidthManager:
  bbr: false                  # BBR requires BPF host routing, incompatible with legacy
  enabled: true               # EDT bandwidth manager still works with CUBIC
# Talos machine config
machine:
  network:
    kubespan:
      enabled: true           # Encrypts all inter-node traffic

Why This Works

  1. bpf.hostLegacyRouting: true forces packets through the kernel networking stack instead of BPF direct-redirect between interfaces. KubeSpan's nftables rules and kernel conntrack can properly track connections — both SNAT and reverse-SNAT happen in the netfilter framework, not split between BPF and netfilter.

  2. bpf.masquerade: true is still set, satisfying Egress Gateway's hard requirement. The BPF masquerade SNAT still runs on eth0 egress for traffic that reaches eth0 (like Egress Gateway's VXLAN-tunneled traffic to the gateway node).

  3. Egress Gateway works because its traffic flow is: pod → cil_to_netdev on eth0 → BPF redirects via VXLAN to the gateway node → gateway node does SNAT. The VXLAN outer packet goes through KubeSpan (encrypted), but the Egress Gateway BPF logic runs before KubeSpan intercepts.

  4. KubeSpan's kubespan interface IS in Talos's hardcoded firewall whitelist — all traffic arriving on kubespan bypasses the nftables ingress chain entirely. No NOTRACK conflict.

Note on advertiseKubernetesNetworks

Do NOT enable advertiseKubernetesNetworks in KubeSpan config — it's not needed in VXLAN tunnel mode (Cilium handles pod routing via VXLAN overlay) and is explicitly unsupported with Cilium. The fix from siderolabs/talos#9043 is also not required in tunnel mode.

Firewall Configuration

Add KubeSpan's WireGuard port to the Talos firewall rules (both controlplane and worker templates):

apiVersion: v1alpha1
kind: NetworkRuleConfig
name: kubespan-wireguard
portSelector:
  ports:
    - 51820
  protocol: udp
ingress:
  # Allow from all cluster node IPs
  - subnet: <node-ip>/32
  # ...

Cilium WireGuard port (51871) can be removed since Cilium encryption is disabled.

What We Verified

Check Result
Cilium health 12/12 reachable, all Node 1/1
KubeSpan mesh All peers UP
Worker-to-worker host TCP OK (was broken with Cilium WireGuard + Talos firewall)
Egress Gateway Working — pods exit via gateway node IP
Cilium connectivity test (ping intra/cross node) Passed
Talos host firewall Active and working

The One Tradeoff

bandwidthManager.bbr: false is required because BBR needs BPF host routing, which is incompatible with hostLegacyRouting: true. The bandwidth manager still works with CUBIC congestion control. The performance impact of legacy host routing is minimal compared to VXLAN + WireGuard overhead already present.

Root Cause Deep Dive

Finding this solution required kernel-level debugging with pwru:

  1. Forward path works: SYN → BPF redirect to cilium_wg0 → WireGuard encrypt → UDP to remote node → WireGuard decrypt → cil_from_wireguardcilium_net/cilium_hostip_rcvtcp_v4_rcv → SYN accepted
  2. Return path breaks: SYN-ACK → WireGuard encrypt → UDP to originating node → WireGuard decrypt → enters via cilium_hostip_rcvnftables INPUT chain drops it (SKB_DROP_REASON_NETFILTER_DROP)
  3. Why nftables drops it: Cilium's CILIUM_PRE_raw chain sets CT --notrack on mark 0x0D00 packets. The SYN-ACK is UNTRACKED, not ESTABLISHED. Talos's ct state established,related rule doesn't match. No port-specific rule matches the SYN-ACK (destination port is ephemeral). Default policy: drop.
  4. Why cilium_host isn't whitelisted: Talos's interface whitelist is hardcoded to lo, siderolink, kubespan in nftables_chain_config.go. Not configurable via machine config.

Environment

  • TalosOS v1.12.5 (kernel 6.18.15)
  • Cilium v1.19.1
  • 12 nodes across 2 hosters, communicating over public internet
  • VXLAN tunnel mode + KubeSpan WireGuard encryption
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment