Skip to content

Instantly share code, notes, and snippets.

@squat
Last active April 23, 2021 04:27
Show Gist options
  • Save squat/1c2799c3565c383fe4b1499c101bfc49 to your computer and use it in GitHub Desktop.
Save squat/1c2799c3565c383fe4b1499c101bfc49 to your computer and use it in GitHub Desktop.
Debugging VXLAN issues on Azure

Azure VXLAN Issues

Introduction

Between June 24-25, the nodes in Tectonic clusters running on Azure automatically updated the OS from Container Linux 1353.8.0 to 1409.2.0. After this upgrade, the nodes began to experience increased latency and failure rate in requests. Interestingly, we found that the size of the HTTP request played a role in determining the success of requests to services running on the Kubernetes cluster. We found that setting the client's interface's MTU to 1370 made all requests succeed; incrementing the MTU to 1371 caused the failure of large HTTP requests to resurface. Additionally, enabling TCP MTU probing on the clients ensured all requests would succeed, albeit with increased latency. In order to identify the minimum set of circumstances needed to reproduce the issue, I ran several tests involving different network topologies. I found that the following elements were required to trigger failure:

  • VXLAN;
  • MTU path discovery; and
  • the Linux kernel version present in Container Linux 1409.2.0 or later.

The following document details how to configure a minimal enviroment to reproduce the issue and test different kernels.

Topology

Reproducing the VXLAN issues requires two Container Linux virtual machines (VMs) running in Azure cloud and connected with a VXLAN. Note: for simplicity, the VMs can be placed in the same Azure virtual network and subnet, however this is not strictly necessary. The setup is as follows:

  • a Linux client, e.g. a laptop, makes an HTTP request to node 1
  • node 1 NATs the request:
    • node 1 NATs the packet's destination to the IP of the container running on node 2
    • node 1 NATs the packet's source to the IP of its VXLAN interface
  • node 1 forwards the request over the VXLAN to node 2
  • node 2 either receives or does not receive the packets on its VXLAN interface and routes them to the container

Visually:

Node 1 <---NAT---> <---VXLAN---> Node 2 <---> container
  ^
  |
  |
  v
Client

Getting Started

Launch two Container Linux VMs on Azure. Ensure both VMs are reachable over TCP ports 22, 80, and UDP port 8472, the default VXLAN port on Linux. Additionally, both VMs should be configured with public IP addresses. This exercise will assume the nodes share a subnet, however as long as UDP port 8472 is reachable via the public IPs, this should not matter. Visually, the cluster should look like:

------------------------10.0.1.0/24-----------------------
|       Node 1                             Node 2        |
|10.0.1.4/24 (eth0)                 10.0.1.5/24 (eth0)   |
----------------------------------------------------------

VXLAN Configuration

SSH onto node 1 and create a unicast VXLAN interface:

ip link add vxlan0 type vxlan id 10 dev eth0 dstport 0

Configure the VXLAN connection to node 2:

bridge fdb add 00:00:00:00:00:00 dev vxlan0 dst 10.0.1.5

Bring up the VXLAN and give it an IP address:

ip addr add 10.0.0.1/24 dev vxlan0
ip link set up dev vxlan0

Repeat the same process for node 2:

ip link add vxlan0 type vxlan id 10 dev eth0 dstport 0
bridge fdb add 00:00:00:00:00:00 dev vxlan0 dst 10.0.1.4
ip addr add 10.0.0.2/24 dev vxlan0
ip link set up dev vxlan0

Verify that the two VXLAN interfaces can communicate with simple pings. From node 1:

ping 10.0.0.2

From node 2:

ping 10.0.0.1

Visually, the cluster should now look like:

------------------------10.0.1.0/24-----------------------
|       Node 1                             Node 2        |
|10.0.1.4/24 (eth0)                 10.0.1.5/24 (eth0)   |
|10.0.0.1/24 (vxlan0) <---VXLAN---> 10.0.0.2/24 (vxlan0) |
----------------------------------------------------------

Container Configuration

For purposes of simplicity, run a simple nginx container on node 2, and find its IP address:

# node 2
CID=$(docker run -d nginx)
CIP=$(docker inspect --format '{{ .NetworkSettings.IPAddress }}' $CID)

Verify that the container is reachable via its IP address from node 2:

# node 2
curl $CIP

On node 1, configure a route to the nginx container:

# node 1
export CIP=<nginx-container-ip>
ip route add $CIP via 10.0.0.2

Verify that the container is reachable via its IP address from node 1:

# node 1
curl $CIP

IPTables Configuration

In order for the container to be reachable by clients making requests to node 1, that node must forward requests to the nginx container running on node 2. On node 1, configure the following NAT and FORWARD rules:

# node 1
iptables -t nat -A PREROUTING -p tcp -i eth0 --dport 80 -j DNAT --to-destination $CIP:80
iptables -t nat -A POSTROUTING -o vxlan0 -j MASQUERADE
iptables -A FORWARD -s $CIP -j ACCEPT
iptables -A FORWARD -d $CIP -j ACCEPT

Test VXLAN

Finally, now that the infrastructure is suitably configured, we can test the connection from a client that has TCP MTU probing disabled. HTTP requests shorter than 1378 bytes should succeed. The following should work as expected:

# client
curl <node-1-public-ip>

HTTP requests longer than 1378 bytes should fail on kernels affected by the regression. The following request should fail:

# client
curl <node-1-public-ip> -H "Authorization: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
@CoaxVex
Copy link

CoaxVex commented Dec 18, 2019

I believe I just ran into this while troubleshooting networking issues with OpenShift 4 on Azure. Running "ethtool --offload eth0 tx off" on the node fixes it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment