Between June 24-25, the nodes in Tectonic clusters running on Azure automatically updated the OS from Container Linux 1353.8.0 to 1409.2.0. After this upgrade, the nodes began to experience increased latency and failure rate in requests. Interestingly, we found that the size of the HTTP request played a role in determining the success of requests to services running on the Kubernetes cluster. We found that setting the client's interface's MTU to 1370 made all requests succeed; incrementing the MTU to 1371 caused the failure of large HTTP requests to resurface. Additionally, enabling TCP MTU probing on the clients ensured all requests would succeed, albeit with increased latency. In order to identify the minimum set of circumstances needed to reproduce the issue, I ran several tests involving different network topologies. I found that the following elements were required to trigger failure:
- VXLAN;
- MTU path discovery; and
- the Linux kernel version present in Container Linux 1409.2.0 or later.
The following document details how to configure a minimal enviroment to reproduce the issue and test different kernels.
Reproducing the VXLAN issues requires two Container Linux virtual machines (VMs) running in Azure cloud and connected with a VXLAN. Note: for simplicity, the VMs can be placed in the same Azure virtual network and subnet, however this is not strictly necessary. The setup is as follows:
- a Linux client, e.g. a laptop, makes an HTTP request to node 1
- node 1 NATs the request:
- node 1 NATs the packet's destination to the IP of the container running on node 2
- node 1 NATs the packet's source to the IP of its VXLAN interface
- node 1 forwards the request over the VXLAN to node 2
- node 2 either receives or does not receive the packets on its VXLAN interface and routes them to the container
Visually:
Node 1 <---NAT---> <---VXLAN---> Node 2 <---> container
^
|
|
v
Client
Launch two Container Linux VMs on Azure. Ensure both VMs are reachable over TCP ports 22, 80, and UDP port 8472, the default VXLAN port on Linux. Additionally, both VMs should be configured with public IP addresses. This exercise will assume the nodes share a subnet, however as long as UDP port 8472 is reachable via the public IPs, this should not matter. Visually, the cluster should look like:
------------------------10.0.1.0/24-----------------------
| Node 1 Node 2 |
|10.0.1.4/24 (eth0) 10.0.1.5/24 (eth0) |
----------------------------------------------------------
SSH onto node 1 and create a unicast VXLAN interface:
ip link add vxlan0 type vxlan id 10 dev eth0 dstport 0
Configure the VXLAN connection to node 2:
bridge fdb add 00:00:00:00:00:00 dev vxlan0 dst 10.0.1.5
Bring up the VXLAN and give it an IP address:
ip addr add 10.0.0.1/24 dev vxlan0
ip link set up dev vxlan0
Repeat the same process for node 2:
ip link add vxlan0 type vxlan id 10 dev eth0 dstport 0
bridge fdb add 00:00:00:00:00:00 dev vxlan0 dst 10.0.1.4
ip addr add 10.0.0.2/24 dev vxlan0
ip link set up dev vxlan0
Verify that the two VXLAN interfaces can communicate with simple pings. From node 1:
ping 10.0.0.2
From node 2:
ping 10.0.0.1
Visually, the cluster should now look like:
------------------------10.0.1.0/24-----------------------
| Node 1 Node 2 |
|10.0.1.4/24 (eth0) 10.0.1.5/24 (eth0) |
|10.0.0.1/24 (vxlan0) <---VXLAN---> 10.0.0.2/24 (vxlan0) |
----------------------------------------------------------
For purposes of simplicity, run a simple nginx container on node 2, and find its IP address:
# node 2
CID=$(docker run -d nginx)
CIP=$(docker inspect --format '{{ .NetworkSettings.IPAddress }}' $CID)
Verify that the container is reachable via its IP address from node 2:
# node 2
curl $CIP
On node 1, configure a route to the nginx container:
# node 1
export CIP=<nginx-container-ip>
ip route add $CIP via 10.0.0.2
Verify that the container is reachable via its IP address from node 1:
# node 1
curl $CIP
In order for the container to be reachable by clients making requests to node 1, that node must forward requests to the nginx container running on node 2. On node 1, configure the following NAT and FORWARD rules:
# node 1
iptables -t nat -A PREROUTING -p tcp -i eth0 --dport 80 -j DNAT --to-destination $CIP:80
iptables -t nat -A POSTROUTING -o vxlan0 -j MASQUERADE
iptables -A FORWARD -s $CIP -j ACCEPT
iptables -A FORWARD -d $CIP -j ACCEPT
Finally, now that the infrastructure is suitably configured, we can test the connection from a client that has TCP MTU probing disabled. HTTP requests shorter than 1378 bytes should succeed. The following should work as expected:
# client
curl <node-1-public-ip>
HTTP requests longer than 1378 bytes should fail on kernels affected by the regression. The following request should fail:
# client
curl <node-1-public-ip> -H "Authorization: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
@simongottschlag, the underlying issue here ended up being on Azure's side. It turns out that Azure was rolling out a new hypervisor version that was not correctly performing checksum offloading. The solution for us was to disable tx checksum offloading: https://github.com/coreos/tectonic-installer/blob/0ec6b27c6d4ba56f03eef6425f52292aec20cb1c/modules/ignition/resources/services/tx-off.service
This is an undesirable workaround, since checksumming in the guest OS will be slower than when it is done by the host's hardware but it was the only way around.