- Create a multinode cluster kind create cluster --config config.yaml
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: worker
- role: worker- Run a fake loadbalancer controller
cd /home/aojea/go/src/github.com/aojea/networking-controllers/cmd/loadbalancer
go build .
kind get kubeconfig > kind.conf
./loadbalancer --kubeconfig kind.conf --iprange "10.111.111.0/24"- Create a deployment with one pod and expose it with a Loadbalancer
kubectl apply -f https://gist.githubusercontent.com/aojea/369ad6a5d4cbb6b0fbcdd3dd909d9887/raw/0ecd0564c1db4c7103de07accce99da1c7bf91c3/loadbalancer.yaml- Obtain the LoadBalancer assigned IP
kubectl get service lb-service -o jsonpath='{.status.loadBalancer.ingress[0].ip}'
10.111.111.129The loadbalancer has externalTrafficPolicy: Local check that is only reachable from the node where the pod is running
kubectl get pods -l app=MyApp -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
test-deployment-6f45696bd5-xlp96 1/1 Running 0 10m 10.244.2.2 kind-worker2 <none> <none>Get the nodes IPs
kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
kind-control-plane Ready control-plane,master 34m v1.21.2 172.18.0.4 <none> Ubuntu 21.04 4.18.0-301.1.el8.x86_64 containerd://1.5.2
kind-worker Ready <none> 33m v1.21.2 172.18.0.3 <none> Ubuntu 21.04 4.18.0-301.1.el8.x86_64 containerd://1.5.2
kind-worker2 Ready <none> 33m v1.21.2 172.18.0.2 <none> Ubuntu 21.04 4.18.0-301.1.el8.x86_64 containerd://1.5.2- Emulate the external loadbalancer installin a route in the host to the loadbalancer IP through the node with the pod
example: pod is in kind-worker2 with ip 172.18.0.2 and loadbalancer IP 10.111.111.129
sudo ip route add 10.111.111.129 via 172.18.0.2- check it preserves the source ip
echo -e "GET /clientip HTTP/1.1\nhost:myhost\n" | nc 10.111.111.129 80
HTTP/1.1 200 OK
Date: Thu, 15 Jul 2021 23:30:35 GMT
Content-Length: 16
Content-Type: text/plain; charset=utf-8
172.18.0.1:34028- and it doesn't work if we target a node without pods backing that service
sudo ip route del 10.111.111.129 via 172.18.0.2
sudo ip route add 10.111.111.129 via 172.18.0.3
echo -e "GET /clientip HTTP/1.1\nhost:myhost\n" | nc -v -w 3 10.111.111.129 80
Ncat: Version 7.70 ( https://nmap.org/ncat )
Ncat: Connection timed out.
- if we use externalTrafficPolicy cluster it works but IP is not preserved
kubectl patch service lb-service -p '{"spec":{"externalTrafficPolicy":"Cluster"}}'
service/lb-service patched
$ echo -e "GET /clientip HTTP/1.1\nhost:myhost\n" | nc -v -w 3 10.111.111.129 80
Ncat: Version 7.70 ( https://nmap.org/ncat )
Ncat: Connected to 10.111.111.129:80.
HTTP/1.1 200 OK
Date: Thu, 15 Jul 2021 23:34:22 GMT
Content-Length: 16
Content-Type: text/plain; charset=utf-8
172.18.0.3:14569
Ncat: 36 bytes sent, 133 bytes received in 0.01 seconds.- revert the service change and set external traffic policy local
kubectl patch service lb-service -p '{"spec":{"externalTrafficPolicy":"Local"}}'
service/lb-service patched
$ echo -e "GET /clientip HTTP/1.1\nhost:myhost\n" | nc -v -w 3 10.111.111.129 80
Ncat: Version 7.70 ( https://nmap.org/ncat )
Ncat: Connection timed out.- test the service from within node without the pod
echo -e "GET /clientip HTTP/1.1\nhost:myhost\n" | nc -v -w 3 10.111.111.129 80
Connection to 10.111.111.129 80 port [tcp/http] succeeded!
HTTP/1.1 200 OK
Date: Thu, 15 Jul 2021 23:36:50 GMT
Content-Length: 16
Content-Type: text/plain; charset=utf-8We think it shoudn't work because we are in a node without pods however, since the traffic comes from a node within the cluster the traffic is considered internal and externalTrafficPolicy doesn't apply the reason is that kube-proxy installs some iptables rules that capture the traffic within the node.
This causes a problem to test LoadBalancers, because it requires traffic to be sent from an external node, something that is not easy: Kubernetes cluster use to be isolated, expose apiserver endpoint e2e should not assume direct connectivity from the e2e binary
bypass iptables --> netkat
netkat works same as netcat but using raw sockets
- netcat: https://en.wikipedia.org/wiki/Netcat listen: opens a socket and bind to listen connect: opens a socket and connect copies from the socket to stdout and from stdin to the socket
- raw sockets: https://man7.org/linux/man-pages/man7/raw.7.html https://man7.org/linux/man-pages/man7/packet.7.html https://upload.wikimedia.org/wikipedia/commons/3/37/Netfilter-packet-flow.svg
The socket receive a copy of all packets, but it allows to add a BPF filter We don't need all packets, just the ones used by netcat We know some parts of the tuple so we can filter, we use hooks to add BPF filters
On the socket we filter the packets that we don't WANT On ingress we filter the packets that we WANT, so the host doesn't close our connection
Once we receive the packets we have raw packets (with ethernet headers) We are interested only in the stream of data carried over the connection but we are bypassing the kernel TCP/IP stack, so we need something to process the raw packets
Solution: use an userspace TCP/IP stack https://pkg.go.dev/gvisor.dev/gvisor/pkg/tcpip/stack
At this point we have all the pieces, we just need to glue them together Solution: golang
- obtain the connection details: source, dest IP and port, interface
- bpf2go generate the eBPF code (the eBPF code is parametrizable from golang)
- inject the BPF code to the raw socket and to the ingress interface
As result, we have a new version of netcat that bypass iptables (needs CAP_NET_RAW) It has a bonus, with the -d flag it has a sniffer XD
sudo ./netkat -d -l 127.0.0.1 80
2021/07/16 01:58:08 Creating raw socket
2021/07/16 01:58:08 Adding ebpf ingress filter on interface lo
2021/07/16 01:58:08 filter {LinkIndex: 1, Handle: 0:1, Parent: ffff:fff2, Priority: 0, Protocol: 3}
2021/07/16 01:58:08 Creating user TCP/IP stack
2021/07/16 01:58:08 Listening on 127.0.0.1:80
I0716 01:58:31.206864 22926 sniffer.go:418] recv tcp 127.0.0.1:44644 -> 127.0.0.1:80 len:0 id:a5ee flags: S seqnum: 1321663292 ack: 0 win: 65495 xsum:0xfe30 options: {MSS:65495 WS:7 TS:true TSVal:2715804214 TSEcr:0 SACKPermitted:true}
I0716 01:58:31.206937 22926 sniffer.go:418] recv tcp 127.0.0.1:44644 -> 127.0.0.1:80 len:0 id:a5ee flags: S seqnum: 1321663292 ack: 0 win: 65495 xsum:0xfe30 options: {MSS:65495 WS:7 TS:true TSVal:2715804214 TSEcr:0 SACKPermitted:true}
I0716 01:58:32.239938 22926 sniffer.go:418] recv tcp 127.0.0.1:44644 -> 127.0.0.1:80 len:0 id:a5ef flags: S seqnum: 1321663292 ack: 0 win: 65495 xsum:0xfe30 options: {MSS:65495 WS:7 TS:true TSVal:2715805247 TSEcr:0 SACKPermitted:true}
I0716 01:58:32.240044 22926 sniffer.go:418] recv tcp 127.0.0.1:44644 -> 127.0.0.1:80 len:0 id:a5ef flags: S seqnum: 1321663292 ack: 0 win: 65495 xsum:0xfe30 options: {MSS:65495 WS:7 TS:true TSVal:2715805247 TSEcr:0 SACKPermitted:true}
I0716 01:58:34.287962 22926 sniffer.go:418] recv tcp 127.0.0.1:44644 -> 127.0.0.1:80 len:0 id:a5f0 flags: S seqnum: 1321663292 ack: 0 win: 65495 xsum:0xfe30 options: {MSS:65495 WS:7 TS:true TSVal:2715807295 TSEcr:0 SACKPermitted:true}
I0716 01:58:34.288096 22926 sniffer.go:418] recv tcp 127.0.0.1:44644 -> 127.0.0.1:80 len:0 id:a5f0 flags: S seqnum: 1321663292 ack: 0 win: 65495 xsum:0xfe30 options: {MSS:65495 WS:7 TS:true TSVal:2715807295 TSEcr:0 SACKPermitted:true}
^C2021/07/16 01:58:35 Exiting: received signal
2021/07/16 01:58:35 Done
- Run netkat in a pod (using hostnetwork) we are going to simulate an external connecting from inside a node
kubectl apply -f https://gist.githubusercontent.com/aojea/952c82a58da625fbd9b8aca35f0e63f1/raw/d636a71ffe690633be448999d635199ef8724300/netkat.yaml
pod/netkat created- Check where is running
$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
netkat 1/1 Running 0 9s 172.18.0.3 kind-worker <none> <none>
test-deployment-6f45696bd5-xlp96 1/1 Running 0 63m 10.244.2.2 kind-worker2 <none> <none>- login to the container
kubectl exec -it netkat ash
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl kubectl exec [POD] -- [COMMAND] instead.
/ # ip route add 10.111.111.15 via 172.18.0.2
echo -e "GET /clientip HTTP/1.1\nhost:myhost\n" | ./netkat 10.111.111.15 80
2021/07/16 01:06:54 routes {Ifindex: 18 Dst: 10.111.111.15/32 Src: 172.18.0.3 Gw: 172.18.0.2 Flags: [] Table: 254}
2021/07/16 01:06:54 Creating raw socket
2021/07/16 01:06:54 Adding ebpf ingress filter on interface eth0
2021/07/16 01:06:54 filter {LinkIndex: 18, Handle: 0:1, Parent: ffff:fff2, Priority: 0, Protocol: 3}
2021/07/16 01:06:54 Creating user TCP/IP stack
2021/07/16 01:06:54 Dialing ...
2021/07/16 01:06:54 Connection established
2021/07/16 01:06:54 Connection error: <nil>
HTTP/1.1 200 OK
Date: Fri, 16 Jul 2021 01:06:54 GMT
Content-Length: 16
Content-Type: text/plain; charset=utf-8
172.18.0.3:26913we can see it preserves the source ip
- check that if we target a different node it doesn't work as expected because the service has externalTrafficPolicy: local # and it considers the connection as external
/ # ip route del 10.111.111.15 via 172.18.0.3
/ # ip route add 10.111.111.15 via 172.18.0.4
/ # echo -e "GET /clientip HTTP/1.1\nhost:myhost\n" | ./netkat 10.111.111.15 80
2021/07/16 01:11:38 routes {Ifindex: 18 Dst: 10.111.111.15/32 Src: 172.18.0.2 Gw: 172.18.0.4 Flags: [] Table: 254}
2021/07/16 01:11:38 Creating raw socket
2021/07/16 01:11:38 Adding ebpf ingress filter on interface eth0
2021/07/16 01:11:38 filter {LinkIndex: 18, Handle: 0:1, Parent: ffff:fff2, Priority: 0, Protocol: 3}
2021/07/16 01:11:38 Creating user TCP/IP stack
2021/07/16 01:11:38 Dialing ...
2021/07/16 01:11:43 Dialing error: context deadline exceeded
2021/07/16 01:11:43 Can't connect to server: context deadline exceeded