K3s and Cilium with the Egress IP Gateway feature

This is a short guide to deploying a three-node Kubernetes cluster using K3s, including kube-vip to provide a HA control-plane and to manage LoadBalancer Service resources, and finally as Cilium as our CNI with the Egress Gateway feature enabled. We'll also heavily lean into Cilium's support for eBPF by doing away with kube-proxy entirely, but note that this does come with some limitations.

First, let's set some common options for K3s:

export K3S_VERSION="v1.22.4+k3s1"
export K3S_OPTIONS="--flannel-backend=none --no-flannel --disable-kube-proxy --disable-network-policy"

I've got three VMs running openSUSE Leap deployed via created on vSphere - cilium{0..2}. Note that I use govc extensively during this article, and I'll be making use of k3sup to bootstrap my cluster:

k3sup install --cluster --ip $(govc vm.ip /42can/vm/cilium0) --user nick --local-path ~/.kube/cilium.yaml --context cilium --k3s-version $K3S_VERSION --k3s-extra-args $K3S_OPTIONS
k3sup join --ip $(govc vm.ip /42can/vm/cilium1) --server-ip $(govc vm.ip /42can/vm/cilium0) --server --server-user nick --user nick --k3s-version $K3S_VERSION --k3s-extra-args $K3S_OPTIONS
k3sup join --ip $(govc vm.ip /42can/vm/cilium2) --server-ip $(govc vm.ip /42can/vm/cilium0) --server --server-user nick --user nick --k3s-version $K3S_VERSION --k3s-extra-args $K3S_OPTIONS

At this point nodes will be in NotReady status and no Pods will have started as we have no functioning CNI:

% kubectl get nodes -o wide
NAME      STATUS     ROLES                       AGE     VERSION        INTERNAL-IP      EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
cilium0   NotReady   control-plane,etcd,master   2m53s   v1.22.4+k3s1   192.168.20.49    <none>        openSUSE Leap 15.3   5.3.18-57-default   containerd://1.5.8-k3s1
cilium1   NotReady   control-plane,etcd,master   65s     v1.22.4+k3s1   192.168.20.23    <none>        openSUSE Leap 15.3   5.3.18-57-default   containerd://1.5.8-k3s1
cilium2   NotReady   control-plane,etcd,master   24s     v1.22.4+k3s1   192.168.20.119   <none>        openSUSE Leap 15.3   5.3.18-57-default   containerd://1.5.8-k3s1
% kubectl get pods -A                                                                                                            
NAMESPACE     NAME                                     READY   STATUS    RESTARTS   AGE
kube-system   coredns-85cb69466-n7z42                  0/1     Pending   0          2m50s
kube-system   helm-install-traefik--1-xv7q2            0/1     Pending   0          2m50s
kube-system   helm-install-traefik-crd--1-w7w5w        0/1     Pending   0          2m50s
kube-system   local-path-provisioner-64ffb68fd-mfqsr   0/1     Pending   0          2m50s
kube-system   metrics-server-9cf544f65-kqbj4           0/1     Pending   0          2m50s

Control Plane VIP

Before we go any further and install Cilium, ideally we need a VIP for our control plane and we'll provide this using kube-vip. Note that we do this as a static Pod (as opposed to as a DaemonSet) since we don't yet have a functioning CNI. To do this, we need to drop https://kube-vip.io/manifests/rbac.yaml into /var/lib/rancher/k3s/server/manifests and we also need to generate the manifest for our static Pod, customised slightly based on our environment and also for K3s. To create the manifest in my case, I ran these commands:

% alias kube-vip="docker run --network host --rm ghcr.io/kube-vip/kube-vip:v0.4.0"
% kube-vip manifest pod \
    --interface eth0 \
    --vip 192.168.20.200 \
    --controlplane \
    --services \
    --arp \
    --leaderElection > kube-vip.yaml

This created a kube-vip.yaml with the IP I want to use for my VIP (192.168.20.200) as well as the interface on my nodes to which this should be bound (eth0). The file then needs to be edited, and edit the hostPath path to point to /etc/rancher/k3s/k3s.yaml instead of the default /etc/kubernetes/admin.conf, since the path to the kubeconfig which should be used is different with K3s.

With those changes made, this needs to be copied into the default directory for static Pod manifests on all of our nodes: /var/lib/rancher/k3s/agent/pod-manifests:

% for node in cilium{0..2} ;  do cat kube-vip.yaml | ssh $(govc vm.ip $node) 'cat - | sudo tee /var/lib/rancher/k3s/agent/pod-manifests/kube-vip.yaml' ; done

If everything's working, you should see the kube-vip Pods in state running and you should be able to ping your VIP:

% kubectl get pods -n kube-system | grep -i vip
kube-vip-cilium0                         1/1     Running     0          107m
kube-vip-cilium1                         1/1     Running     0          5m9s
kube-vip-cilium2                         1/1     Running     0          2m15s

% ping -c 1 192.168.20.200
PING 192.168.20.200 (192.168.20.200) 56(84) bytes of data.
64 bytes from 192.168.20.200: icmp_seq=1 ttl=63 time=1.78 ms

--- 192.168.20.200 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 1.781/1.781/1.781/0.000 ms

Now we can install Cilium via Helm, specifying the VIP for the service host to which Cilium should connect:

% helm install cilium cilium/cilium --version 1.10.5 \
  --namespace kube-system \
  --set kubeProxyReplacement=strict \
  --set k8sServiceHost=192.168.20.200 \
  --set k8sServicePort=6443 \
  --set egressGateway.enabled=true \
  --set bpf.masquerade=true \
  --set hubble.relay.enabled=true \
  --set hubble.ui.enabled=true

Testing Cilium Egress Gateway IP feature

The Traefik Ingress controller deployed as part of K3s creates a service of type LoadBalancer. As we opted to not install Klipper, we need something else to handle these types of resources and to surface a VIP on our cluster's behalf. Given we're using kube-vip for the control plane, we might as well go ahead and use that as well. First, install the controller component:

kubectl apply -f https://raw.githubusercontent.com/kube-vip/kube-vip-cloud-provider/main/manifest/kube-vip-cloud-controller.yaml

Now create a ConfigMap resource with a range that kube-vip will allocate an IP address from by default:

apiVersion: v1
kind: ConfigMap
metadata:
  name: kubevip
  namespace: kube-system
data:
  range-global: 192.168.20.220-192.168.20.225

Once the kube-vip-cloud-provider-0 in the kube-system namespace is Running, you should see that the LoadBalancer Service for Traefik now has an IP address allocated and is reachable from outside the cluster:

% kubectl get svc traefik -n kube-system
NAME      TYPE           CLUSTER-IP     EXTERNAL-IP      PORT(S)                      AGE
traefik   LoadBalancer   10.43.21.231   192.168.20.220   80:31186/TCP,443:31184/TCP   16h

% curl 192.168.20.220
404 page not found

NB: The 404 is expected and a valid response

The example documentation for the Egress Gateway feature for Cilium suggests that you create a Deployment which launches a container somewhere in your cluster and plumbs in an IP address on an interface, and this IP will be the nominated Egress IP. However, we can make use of an existing IP address within our cluster, so let's use the IP that's been assigned to our Traefik LoadBalancer Service - 192.168.20.220 in my case.

To test this we need an external service running somewhere that we can connect to. I've spun up another VM, imaginatively titled test, and this machine has an IP of 192.168.20.70. I'm just going to launch NGINX via Docker:

% echo 'it works' > index.html
% docker run -d --name nginx -p 80:80 -v $(pwd):/usr/share/nginx/html nginx
% docker logs -f nginx

If we attempt to connect from a client in our cluster to this server, we should something like the following:

% kubectl run tmp-shell --rm -i --tty --image nicolaka/netshoot -- /bin/bash
If you don't see a command prompt, try pressing enter.
bash-5.1# curl 192.168.20.70
it works

And from NGINX:

192.168.20.82 - - [07/Dec/2021:12:35:43 +0000] "GET / HTTP/1.1" 200 9 "-" "curl/7.80.0" "-"

Predictably, the source IP is the node on which my netshoot container is running (cilium0). Now let's add a CiliumEgressNATPolicy which will put in place the configuration to set the source IP address to that of my Traefik LoadBalancer for any traffic destined for my external test server IP:

% cat egress.yaml
kind: CiliumEgressNATPolicy
metadata:
  name: egress-sample
spec:
  egress:
  - podSelector:
      matchLabels:
        io.kubernetes.pod.namespace: default
  destinationCIDRs:
  - 192.168.20.70/32
  egressSourceIP: "192.168.20.220"
% kubectl apply -f egress.yaml
ciliumegressnatpolicy.cilium.io/egress-sample created

Now let's run that curl command again from our netshoot container and observe what happens again in NGINX:

bash-5.1# curl 192.168.20.70
it works

192.168.20.220 - - [07/Dec/2021:12:47:53 +0000] "GET / HTTP/1.1" 200 9 "-" "curl/7.80.0" "-"

There we can see that the source IP of the request is now the VIP assigned to our Traefik LoadBalancer.

We can further verify the NAT that's been put in place by running the cilium bpf egress list command from within one of our Cilium Pods:

% kubectl -n kube-system get pods -l k8s-app=cilium
NAME           READY   STATUS    RESTARTS   AGE
cilium-28gdd   1/1     Running   0          17h
cilium-n5wv7   1/1     Running   0          17h
cilium-vfvkd   1/1     Running   0          16h

% kubectl exec -it cilium-vfvkd -n kube-system -- cilium bpf egress list
SRC IP & DST CIDR            EGRESS INFO
10.0.0.80 192.168.20.70/32   192.168.20.220 192.168.20.220   
10.0.2.11 192.168.20.70/32   192.168.20.220 192.168.20.220

yankcrime/k3s-cilium-egress.md

K3s and Cilium with the Egress IP Gateway feature

Control Plane VIP

Testing Cilium Egress Gateway IP feature

janeczku commented Dec 7, 2021

Uh oh!

janeczku commented Dec 9, 2021

Uh oh!