This is a short guide to deploying a three-node Kubernetes cluster using K3s, including kube-vip to provide a HA control-plane and to manage LoadBalancer Service resources, and finally as Cilium as our CNI with the Egress Gateway feature enabled. We'll also heavily lean into Cilium's support for eBPF by doing away with kube-proxy entirely, but note that this does come with some limitations.
First, let's set some common options for K3s:
export K3S_VERSION="v1.22.4+k3s1"
export K3S_OPTIONS="--flannel-backend=none --no-flannel --disable-kube-proxy --disable-network-policy"
I've got three VMs running openSUSE Leap deployed via created on vSphere - cilium{0..2}
. Note that I use govc
extensively during this article, and I'll be making use of k3sup to bootstrap my cluster:
k3sup install --cluster --ip $(govc vm.ip /42can/vm/cilium0) --user nick --local-path ~/.kube/cilium.yaml --context cilium --k3s-version $K3S_VERSION --k3s-extra-args $K3S_OPTIONS
k3sup join --ip $(govc vm.ip /42can/vm/cilium1) --server-ip $(govc vm.ip /42can/vm/cilium0) --server --server-user nick --user nick --k3s-version $K3S_VERSION --k3s-extra-args $K3S_OPTIONS
k3sup join --ip $(govc vm.ip /42can/vm/cilium2) --server-ip $(govc vm.ip /42can/vm/cilium0) --server --server-user nick --user nick --k3s-version $K3S_VERSION --k3s-extra-args $K3S_OPTIONS
At this point nodes will be in NotReady
status and no Pods will have started as we have no functioning CNI:
% kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
cilium0 NotReady control-plane,etcd,master 2m53s v1.22.4+k3s1 192.168.20.49 <none> openSUSE Leap 15.3 5.3.18-57-default containerd://1.5.8-k3s1
cilium1 NotReady control-plane,etcd,master 65s v1.22.4+k3s1 192.168.20.23 <none> openSUSE Leap 15.3 5.3.18-57-default containerd://1.5.8-k3s1
cilium2 NotReady control-plane,etcd,master 24s v1.22.4+k3s1 192.168.20.119 <none> openSUSE Leap 15.3 5.3.18-57-default containerd://1.5.8-k3s1
% kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system coredns-85cb69466-n7z42 0/1 Pending 0 2m50s
kube-system helm-install-traefik--1-xv7q2 0/1 Pending 0 2m50s
kube-system helm-install-traefik-crd--1-w7w5w 0/1 Pending 0 2m50s
kube-system local-path-provisioner-64ffb68fd-mfqsr 0/1 Pending 0 2m50s
kube-system metrics-server-9cf544f65-kqbj4 0/1 Pending 0 2m50s
Before we go any further and install Cilium, ideally we need a VIP for our control plane and we'll provide this using kube-vip. Note that we do this as a static Pod (as opposed to as a DaemonSet) since we don't yet have a functioning CNI. To do this, we need to drop https://kube-vip.io/manifests/rbac.yaml into /var/lib/rancher/k3s/server/manifests
and we also need to generate the manifest for our static Pod, customised slightly based on our environment and also for K3s. To create the manifest in my case, I ran these commands:
% alias kube-vip="docker run --network host --rm ghcr.io/kube-vip/kube-vip:v0.4.0"
% kube-vip manifest pod \
--interface eth0 \
--vip 192.168.20.200 \
--controlplane \
--services \
--arp \
--leaderElection > kube-vip.yaml
This created a kube-vip.yaml
with the IP I want to use for my VIP (192.168.20.200
) as well as the interface on my nodes to which this should be bound (eth0
). The file then needs to be edited, and edit the hostPath
path
to point to /etc/rancher/k3s/k3s.yaml
instead of the default /etc/kubernetes/admin.conf
, since the path to the kubeconfig which should be used is different with K3s.
With those changes made, this needs to be copied into the default directory for static Pod manifests on all of our nodes: /var/lib/rancher/k3s/agent/pod-manifests
:
% for node in cilium{0..2} ; do cat kube-vip.yaml | ssh $(govc vm.ip $node) 'cat - | sudo tee /var/lib/rancher/k3s/agent/pod-manifests/kube-vip.yaml' ; done
If everything's working, you should see the kube-vip
Pods in state running and you should be able to ping your VIP:
% kubectl get pods -n kube-system | grep -i vip
kube-vip-cilium0 1/1 Running 0 107m
kube-vip-cilium1 1/1 Running 0 5m9s
kube-vip-cilium2 1/1 Running 0 2m15s
% ping -c 1 192.168.20.200
PING 192.168.20.200 (192.168.20.200) 56(84) bytes of data.
64 bytes from 192.168.20.200: icmp_seq=1 ttl=63 time=1.78 ms
--- 192.168.20.200 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 1.781/1.781/1.781/0.000 ms
Now we can install Cilium via Helm, specifying the VIP for the service host to which Cilium should connect:
% helm install cilium cilium/cilium --version 1.10.5 \
--namespace kube-system \
--set kubeProxyReplacement=strict \
--set k8sServiceHost=192.168.20.200 \
--set k8sServicePort=6443 \
--set egressGateway.enabled=true \
--set bpf.masquerade=true \
--set hubble.relay.enabled=true \
--set hubble.ui.enabled=true
The Traefik Ingress controller deployed as part of K3s creates a service of type LoadBalancer. As we opted to not install Klipper, we need something else to handle these types of resources and to surface a VIP on our cluster's behalf. Given we're using kube-vip for the control plane, we might as well go ahead and use that as well. First, install the controller component:
kubectl apply -f https://raw.githubusercontent.com/kube-vip/kube-vip-cloud-provider/main/manifest/kube-vip-cloud-controller.yaml
Now create a ConfigMap
resource with a range that kube-vip will allocate an IP address from by default:
apiVersion: v1
kind: ConfigMap
metadata:
name: kubevip
namespace: kube-system
data:
range-global: 192.168.20.220-192.168.20.225
Once the kube-vip-cloud-provider-0
in the kube-system
namespace is Running
, you should see that the LoadBalancer Service for Traefik now has an IP address allocated and is reachable from outside the cluster:
% kubectl get svc traefik -n kube-system
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
traefik LoadBalancer 10.43.21.231 192.168.20.220 80:31186/TCP,443:31184/TCP 16h
% curl 192.168.20.220
404 page not found
NB: The 404 is expected and a valid response
The example documentation for the Egress Gateway feature for Cilium suggests that you create a Deployment which launches a container somewhere in your cluster and plumbs in an IP address on an interface, and this IP will be the nominated Egress IP. However, we can make use of an existing IP address within our cluster, so let's use the IP that's been assigned to our Traefik LoadBalancer Service - 192.168.20.220
in my case.
To test this we need an external service running somewhere that we can connect to. I've spun up another VM, imaginatively titled test
, and this machine has an IP of 192.168.20.70
. I'm just going to launch NGINX via Docker:
% echo 'it works' > index.html
% docker run -d --name nginx -p 80:80 -v $(pwd):/usr/share/nginx/html nginx
% docker logs -f nginx
If we attempt to connect from a client in our cluster to this server, we should something like the following:
% kubectl run tmp-shell --rm -i --tty --image nicolaka/netshoot -- /bin/bash
If you don't see a command prompt, try pressing enter.
bash-5.1# curl 192.168.20.70
it works
And from NGINX:
192.168.20.82 - - [07/Dec/2021:12:35:43 +0000] "GET / HTTP/1.1" 200 9 "-" "curl/7.80.0" "-"
Predictably, the source IP is the node on which my netshoot container is running (cilium0
). Now let's add a CiliumEgressNATPolicy
which will put in place the configuration to set the source IP address to that of my Traefik LoadBalancer for any traffic destined for my external test server IP:
% cat egress.yaml
kind: CiliumEgressNATPolicy
metadata:
name: egress-sample
spec:
egress:
- podSelector:
matchLabels:
io.kubernetes.pod.namespace: default
destinationCIDRs:
- 192.168.20.70/32
egressSourceIP: "192.168.20.220"
% kubectl apply -f egress.yaml
ciliumegressnatpolicy.cilium.io/egress-sample created
Now let's run that curl
command again from our netshoot container and observe what happens again in NGINX:
bash-5.1# curl 192.168.20.70
it works
192.168.20.220 - - [07/Dec/2021:12:47:53 +0000] "GET / HTTP/1.1" 200 9 "-" "curl/7.80.0" "-"
There we can see that the source IP of the request is now the VIP assigned to our Traefik LoadBalancer.
We can further verify the NAT that's been put in place by running the cilium bpf egress list
command from within one of our Cilium Pods:
% kubectl -n kube-system get pods -l k8s-app=cilium
NAME READY STATUS RESTARTS AGE
cilium-28gdd 1/1 Running 0 17h
cilium-n5wv7 1/1 Running 0 17h
cilium-vfvkd 1/1 Running 0 16h
% kubectl exec -it cilium-vfvkd -n kube-system -- cilium bpf egress list
SRC IP & DST CIDR EGRESS INFO
10.0.0.80 192.168.20.70/32 192.168.20.220 192.168.20.220
10.0.2.11 192.168.20.70/32 192.168.20.220 192.168.20.220
Very helpful 💯