Design and Implementation of K8s Services Proxy using eBpf

Goals and Priorities

Build an eBpf based implementation of Kubernetes Services (ClusterIP, NodePort, LoadBalancer) to replace Kube-proxy/ iptables and CNI based implementations of Kubernetes services.
The goal is not "use as much eBpf" as possible but rather to use eBpf selectively and opportunistically and also to leverage standard kernel datapaths as much as possible unless there is a good reason to do otherwise.
Since iptables packages are being deprecated in the Linux kernel and RHEL, it is necessary to have an implementation of kube-proxy that does not depend on iptables. See iptables deprecation
Primary design requirement is to retain end user experience for stability and debuggability when replacing the kube-proxy/ iptables based datapath. This requirement is more important that flat out data plane performance if that comes at the cost of stability, debuggability and familiarity for end users.

Approaches Evaluated

A1: Write a complete new data path including new Connection tracking (conntrack), NAT and load balance functions/ programs in eBpf
A2: Leverage conntrack module and nat tables from Linux kernel but use new eBpf tc/ xdp programs to set these up
A3: Use Socket based load balancing and data path techniques to bypass kernel conntrack, netfilter and nat datapaths.

Details of these approaches are documented separately but very briefly, approaches A1 and A2 try to mirror the data path logic of ipTables based Kube-Proxy implementation without actually using iptables. Approach A3 in contrast would rely on implementing a socket level L4 proxy function implemented in eBpf for terminating and re-initiating connections initiated by external clients to destination NodePorts.

Based on analysis of pros/ cons of these options and the desired priotization of user experience and stability, we are currently planning on using approach A2 for this work although we continue to analyze approaches A1 and A3 and may opportunistically use some aspects of those approaches in the final implementation.

As a side note, the Cilium project uses a design similar to approach A3 for implementing ClusterIP services using eBpf and a design similar to approach A1 for NodePort services. The eBpf helper functions needed for approach A2 are only recently getting developed and the current direction for this project is to use the early pre-GA versions of that infrastructure. These options were not available to the Cilium project. This also fits in line with our requirement to leverage the Linux kernel as much as possible and focus on stability, debuggability and familiarity for end users. However we will continue to track all 3 options and potential hybrid solutions that combine aspects of these different approaches, if that helps with our end goal requirements and priorities.

Phase 1

Prototype a Kube-proxy replacement implementation using KubeProxy-NG + BPF socket connect based datapath for ClusterIP services (approach A3) and tc-bpf + kernel conntrack/ nat based implementation for NodePort services (i.e. approach A2). Since this phase will rely on new bpf helper functions that are not yet in any Linux distribution, the focus will be to confirm the viability of these approaches and gather learning/ experience for the Phase 2 implementation and eventual release. In Phase 1, we will leverage the Kube-Proxy NG (aka KPNG) project as the baseline controller for watching and processing K8s services. However this project will not be completely tied to KPNG or to a specific backend of backend and in Phase 2 we will make a call whether to continue with KPNG or not depending on the upstream readiness of KPNG and the appropriate KPNG backend at that time.

Functional Design for Conntrack based NodePort Service

The figures below illustrate the K8s NodePort service. A Kubernetes service has been exposed externally on port 31000 of all nodes of a Kubernetes cluster. In the first figure, we have a single node k8s cluster and all backend pods of this service are located on that one node. Traffic from external clients a.a.a.a and b.b.b.b are each load balanced (and NAT'ed) to one of these backend pods. In the second figure, we have a multi-node k8s cluster and backends are distributed across the nodes. In this case, traffic may come into the cluster via one node and the load balancing decision may pick a backend pod that resides on a different node, in which case this traffic will need to get re-routed back out to the right node in order to reach the selected nackend pod. There are several additional details here related to load balancing policy, handling of external traffic and return traffic etc which we do not list here but are documented elsewhere including upstream Kubernetes documentation.

Approach A2 for NodePort services

The figures below illustrate the Linux Kernel and Netfilter data path for traffic received from external clients and destined to Kubernetes NodePort services.

We can see that traffic normally would need to go through a Conntrack module in order to imeplement stateful tracking and NAT operations. For approach A2, we could choose to have eBpf programs at the xdp or tc-ingress hook points as shown in the figure. Initially we will use tc-ingress based hooks and eventually add xdp (and possibly even socket lookup) based eBpf programs.

With a tc-ingress based approach, the basic logic of the tc-eBpf program using approach A2 will be to intercept incoming traffic, check if this is destined to a NodePort service, and check if the Linux kernel alreay has created connection tracking state for this connection. If this traffic is for a new connection, the eBpf program will perform a load balancing operation to select a backend and then call the kernal to create connection tracking entries as well as NAT mapping according to its selected load balanced backend. The new eBpf helper functions will be used to perform these updates in the Linux kernel. The packet will then be allowed to use the normal Linux kernel datapath and will egt load balanced and DNATed (and possibly SNATed) according to the CT and NAT mapping entries setup by the tc-ingress eBpf program). Similarly reverse NAT entries will also be setup so that return traffic for the same connection is matched to the same connection state and un-NAT'ed accordingly.

The new eBpf helper functions that will be used to implement this functionality include:

struct nf_conn *
bpf_xdp_ct_lookup(struct xdp_md *, struct bpf_sock_tuple *, u32,
		  struct bpf_ct_opts *, u32);

void bpf_ct_release(struct nf_conn *);
void bpf_ct_set_timeout(struct nf_conn___init *, u32);
int bpf_ct_set_status(const struct nf_conn___init *, u32 );
int bpf_ct_change_timeout(struct nf_conn *, u32);
int bpf_ct_set_nat_info(struct nf_conn___init *, union nf_inet_addr *,
			__be16 *, enum nf_nat_manip_type);
struct nf_conn___init *
bpf_xdp_ct_alloc(struct xdp_md *, struct bpf_sock_tuple *,
		 u32, struct bpf_ct_opts *, u32);
struct nf_conn *bpf_ct_insert_entry(struct nf_conn___init *);

Note: Review these and change the xdp versions to the skb/ tc versions based on code updates in progress upstream kernel.

The logic of these functions is selef-explanatory (CRUD of CT/ connection tracking entries and NAT mappings in the Linux kernel). The current timeline for most of these helpers is to be available in kernel v6.0 and the remaining in kernel v6.1 both of which should GA before the end of CY 2022. Additional details to be added after completion of Phase 1.