This directory contains server logs and client logs from two runs; the first, at 03:04 UTC, and the second at 03:08 UTC. Additionally, there are pcap files from the server for this time.
The first test, at 03:04, was performed with Endpoint Independent Mapping enabled. The second, at 03:08 UTC, was with EIM disabled, and DPA enabled. The resets were only observed in the second test.
The test scenario consists of one server and multiple clients. The clients run as pods in a GKE cluster, and the logs were collected via stern
. The server is running on an AWS VM, and sits behind a multi-AZ LB. Cross-AZ traffic is enabled, so we have 3 IPs fronting the LB, with one DNS name, and all three route to the same instance.
On startup, each client sends an introductory message to the server, which looks like this:
name:tcp-test-client-deployment-78cc89b4c7-wb249, source:10.64.0.89:49619 dest:54.72.175.147:12345
The name represents the pod name, useful for tracking the source. The source IP is the Pod's IP and the source port (this can be set using an environment variable, but was left as the default "pick one at random" of 0 for this test). The destination represents the IP that the client is connecting to, as determined by selecting one A record from the DNS listing (this is all done by the regular NSS system; nothing special in the test). After the first message, every second, it sends a message with an increasing counter
ctr:0 from tcp-test-client-deployment-78cc89b4c7-pfwdr
The client does not read from the TCP connection, but if it detects a reset, it exits, and k8s will restart the pod.
The server listens for incoming messages, creates a new socket for each, and logs the messages it received. It logs them in the form:
2024-06-05 03:08:38,593 __main__ INFO ('34.105.235.81', 1504) Received message: ctr:1 from tcp-test-client-deployment-78cc89b4c7-rn6bg
Where the IP and port are the source of the packet as seen by the server. In the case of Cloud NAT, this will be the NAT IP and port.
The actual test consists of starting with 1 pod, and then, every 5 seconds, bring up 3 more, until there are 80 running (technically 82). The expectation is that all pods run. When the test fails, we see two things:
- Resets in the server where the client remains running. In this case, the client would presumably have disconnected if it had detected the reset, but it didn't read. In the case of the real application (RabbitMQ), the Heartbeat would have caught this.
- Resets detected by the client. In this case, a new pod comes up and attempts to reconnect.
Compare without EIM:
$ grep 'name:' server_2024-06-05_03-08.log | awk '{print $5, $6}' | sort | uniq -c
3 ('34.105.235.81', 1504)
3 ('34.105.235.81', 1505)
3 ('34.105.235.81', 1506)
2 ('34.105.235.81', 1507)
1 ('34.105.235.81', 1508)
3 ('34.105.235.81', 2016)
3 ('34.105.235.81', 2017)
3 ('34.105.235.81', 2018)
2 ('34.105.235.81', 2019)
3 ('34.105.235.81', 2528)
3 ('34.105.235.81', 2529)
2 ('34.105.235.81', 2530)
2 ('34.105.235.81', 2531)
2 ('34.105.235.81', 2532)
1 ('34.105.235.81', 2533)
1 ('34.105.235.81', 2534)
With EIM:
1 ('34.105.235.81', 1036)
1 ('34.105.235.81', 1037)
1 ('34.105.235.81', 1038)
1 ('34.105.235.81', 1039)
1 ('34.105.235.81', 1041)
1 ('34.105.235.81', 1042)
1 ('34.105.235.81', 1044)
1 ('34.105.235.81', 1045)
1 ('34.105.235.81', 1046)
1 ('34.105.235.81', 1048)
1 ('34.105.235.81', 1049)
1 ('34.105.235.81', 1050)
1 ('34.105.235.81', 1051)
1 ('34.105.235.81', 1052)
1 ('34.105.235.81', 1053)
1 ('34.105.235.81', 1055)
1 ('34.105.235.81', 1056)
1 ('34.105.235.81', 1057)
1 ('34.105.235.81', 1058)
1 ('34.105.235.81', 1060)
1 ('34.105.235.81', 1061)
1 ('34.105.235.81', 1062)
1 ('34.105.235.81', 1063)
1 ('34.105.235.81', 1553)
1 ('34.105.235.81', 1555)
1 ('34.105.235.81', 1556)
1 ('34.105.235.81', 1558)
1 ('34.105.235.81', 1559)
1 ('34.105.235.81', 1560)
1 ('34.105.235.81', 1562)
1 ('34.105.235.81', 1563)
1 ('34.105.235.81', 1564)
1 ('34.105.235.81', 1565)
1 ('34.105.235.81', 1566)
1 ('34.105.235.81', 1569)
1 ('34.105.235.81', 1570)
1 ('34.105.235.81', 1571)
1 ('34.105.235.81', 1572)
1 ('34.105.235.81', 1573)
1 ('34.105.235.81', 1575)
1 ('34.105.235.81', 1576)
1 ('34.105.235.81', 1577)
1 ('34.105.235.81', 1578)
1 ('34.105.235.81', 1580)
1 ('34.105.235.81', 1582)
1 ('34.105.235.81', 1583)
1 ('34.105.235.81', 2060)
1 ('34.105.235.81', 2061)
1 ('34.105.235.81', 2062)
1 ('34.105.235.81', 2063)
1 ('34.105.235.81', 2064)
1 ('34.105.235.81', 2065)
1 ('34.105.235.81', 2067)
1 ('34.105.235.81', 2068)
1 ('34.105.235.81', 2069)
1 ('34.105.235.81', 2070)
1 ('34.105.235.81', 2071)
1 ('34.105.235.81', 2072)
1 ('34.105.235.81', 2073)
1 ('34.105.235.81', 2074)
1 ('34.105.235.81', 2076)
1 ('34.105.235.81', 2077)
1 ('34.105.235.81', 2078)
1 ('34.105.235.81', 2079)
1 ('34.105.235.81', 2080)
1 ('34.105.235.81', 2081)
1 ('34.105.235.81', 2083)
1 ('34.105.235.81', 2084)
1 ('34.105.235.81', 2085)
1 ('34.105.235.81', 2086)
1 ('34.105.235.81', 2087)
1 ('34.105.235.81', 2088)
1 ('34.105.235.81', 2089)
1 ('34.105.235.81', 2090)
1 ('34.105.235.81', 2092)
1 ('34.105.235.81', 2093)
1 ('34.105.235.81', 2094)
1 ('34.105.235.81', 2095)
1 ('34.105.235.81', 2096)
1 ('34.105.235.81', 2097)
1 ('34.105.235.81', 2098)
1 ('34.105.235.81', 2100)
2024-06-05 03:08:30,958 __main__ INFO ('34.105.235.81', 2528) Received message: name:tcp-test-client-deployment-78cc89b4c7-wb249, source:10.64.0.89:49619 dest:54.72.175.147:12345
And then we see wb249 sends several messages
2024-06-05 03:08:31,958 __main__ INFO ('34.105.235.81', 2528) Received message: ctr:0 from tcp-test-client-deployment-78cc89b4c7-wb249
2024-06-05 03:08:32,958 __main__ INFO ('34.105.235.81', 2528) Received message: ctr:1 from tcp-test-client-deployment-78cc89b4c7-wb249
2024-06-05 03:08:33,959 __main__ INFO ('34.105.235.81', 2528) Received message: ctr:2 from tcp-test-client-deployment-78cc89b4c7-wb249
2024-06-05 03:08:34,959 __main__ INFO ('34.105.235.81', 2528) Received message: ctr:3 from tcp-test-client-deployment-78cc89b4c7-wb249
2024-06-05 03:08:35,960 __main__ INFO ('34.105.235.81', 2528) Received message: ctr:4 from tcp-test-client-deployment-78cc89b4c7-wb249
2024-06-05 03:08:36,960 __main__ INFO ('34.105.235.81', 2528) Received message: ctr:5 from tcp-test-client-deployment-78cc89b4c7-wb249
2024-06-05 03:08:37,960 __main__ INFO ('34.105.235.81', 2528) Received message: ctr:6 from tcp-test-client-deployment-78cc89b4c7-wb249
2024-06-05 03:08:38,961 __main__ INFO ('34.105.235.81', 2528) Received message: ctr:7 from tcp-test-client-deployment-78cc89b4c7-wb249
2024-06-05 03:08:39,961 __main__ INFO ('34.105.235.81', 2528) Received message: ctr:8 from tcp-test-client-deployment-78cc89b4c7-wb249
2024-06-05 03:08:40,962 __main__ INFO ('34.105.235.81', 2528) Received message: ctr:9 from tcp-test-client-deployment-78cc89b4c7-wb249
And then a reset, and now pcfxb is using 2528
2024-06-05 03:08:41,700 __main__ WARNING ('34.105.235.81', 2528) Connection reset
2024-06-05 03:08:41,715 __main__ INFO ('34.105.235.81', 2528) Accepted new connection
2024-06-05 03:08:41,715 __main__ INFO ('34.105.235.81', 2528) Received message: name:tcp-test-client-deployment-78cc89b4c7-pcfxb, source:10.64.0.91:44951 dest:52.208.209.14:12345
...
2024-06-05 03:08:42,715 __main__ INFO ('34.105.235.81', 2528) Received message: ctr:0 from tcp-test-client-deployment-78cc89b4c7-pcfxb
2024-06-05 03:08:43,716 __main__ INFO ('34.105.235.81', 2528) Received message: ctr:1 from tcp-test-client-deployment-78cc89b4c7-pcfxb
There are no more messages from wb249
seen anywhere.
tcp-test-client-deployment-78cc89b4c7-wb249 tcp-test-client 2024-06-05 03:08:30,951 __main__ INFO 54.72.175.147:12345 connected (10.64.0.89:49619 to tcp-test-4f4a4c45f030567f.elb.eu-west-1.amazonaws.com:12345)
tcp-test-client-deployment-78cc89b4c7-wb249 tcp-test-client 2024-06-05 03:08:31,952 __main__ INFO 54.72.175.147:12345 Sent 0
tcp-test-client-deployment-78cc89b4c7-wb249 tcp-test-client 2024-06-05 03:08:32,952 __main__ INFO 54.72.175.147:12345 Sent 1
tcp-test-client-deployment-78cc89b4c7-wb249 tcp-test-client 2024-06-05 03:08:33,953 __main__ INFO 54.72.175.147:12345 Sent 2
tcp-test-client-deployment-78cc89b4c7-wb249 tcp-test-client 2024-06-05 03:08:34,953 __main__ INFO 54.72.175.147:12345 Sent 3
tcp-test-client-deployment-78cc89b4c7-wb249 tcp-test-client 2024-06-05 03:08:35,954 __main__ INFO 54.72.175.147:12345 Sent 4
tcp-test-client-deployment-78cc89b4c7-wb249 tcp-test-client 2024-06-05 03:08:36,954 __main__ INFO 54.72.175.147:12345 Sent 5
tcp-test-client-deployment-78cc89b4c7-pcfxb tcp-test-client 2024-06-05 03:08:41,709 __main__ INFO 52.208.209.14:12345 connected (10.64.0.91:44951 to tcp-test-4f4a4c45f030567f.elb.eu-west-1.amazonaws.com:12345)
tcp-test-client-deployment-78cc89b4c7-pcfxb tcp-test-client 2024-06-05 03:08:42,710 __main__ INFO 52.208.209.14:12345 Sent 0
tcp-test-client-deployment-78cc89b4c7-pcfxb tcp-test-client 2024-06-05 03:08:43,710 __main__ INFO 52.208.209.14:12345 Sent 1
Note that they have different source IPs and source ports (different pods), but the CloudNAT gives them the same NAT IP / port. From the text within the server logs, we can see that they were headed for different IPs (54.72.175.147:12345 and 52.208.209.14:12345)