I'm experiencing the weirdest internet issue. I use Comcast Xfinity in El Cerrito, CA. I have a new Arris SB6190 cable modem running firmware version 9.1.93N. A few days ago, I noticed that new tabs start taking a few seconds to load, and curl requests to example.com took ~5.1 seconds. However, if I specified -4 or -6 to curl, it resolved immediately. I immediately suspected DNS, but "host" resolved immediately as well.
So this was too weird to ignore. I set my resolv.conf to point to 8.8.8.8 (instead of my EdgeRouter X) and did:
time strace -f -T curl -s http://www.example.com > /dev/null 2> bad
time strace -f -T curl -4 -s http://www.example.com > /dev/null 2> good
Now if you look at the resulting logs, you'll see that curl starts a new thread to handle DNS resolution. In the good logs, the relevant bit looks like:
[pid 12438] socket(AF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 3 <0.000005>
[pid 12438] connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("8.8.8.8")}, 16) = 0 <0.000008>
[pid 12438] poll([{fd=3, events=POLLOUT}], 1, 0) = 1 ([{fd=3, revents=POLLOUT}]) <0.000004>
[pid 12438] sendto(3, "v\203\1\0\0\1\0\0\0\0\0\0\3www\7example\3com\0\0\1\0"..., 33, MSG_NOSIGNAL, NULL, 0) = 33 <0.000018>
[pid 12438] poll([{fd=3, events=POLLIN}], 1, 5000 <unfinished ...>
[pid 12438] <... poll resumed> ) = 1 ([{fd=3, revents=POLLIN}]) <0.013715>
[pid 12438] ioctl(3, FIONREAD, [49]) = 0 <0.000005>
[pid 12438] recvfrom(3, "v\203\201\200\0\1\0\1\0\0\0\0\3www\7example\3com\0\0\1\0"..., 1024, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("8.8.8.8")}, [28->16]) = 49 <0.000005>
[pid 12438] close(3)
However, in the bad logs, you'll see:
[pid 12505] connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("8.8.8.8")}, 16) = 0 <0.000013>
[pid 12505] poll([{fd=3, events=POLLOUT}], 1, 0) = 1 ([{fd=3, revents=POLLOUT}]) <0.000007>
[pid 12505] sendmmsg(3, [{msg_hdr={msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\f\211\1\0\0\1\0\0\0\0\0\0\3www\7example\3com\0\0\1\0"..., iov_len=33}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CTRUNC|MSG_EOR|MSG_FIN|MSG_SYN|MSG_CONFIRM|MSG_BATCH|MSG_FASTOPEN|MSG_CMSG_CLOEXEC|0x1c20010}, msg_len=33}, {msg_hdr={msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base=">\2\1\0\0\1\0\0\0\0\0\0\3www\7example\3com\0\0\34\0"..., iov_len=33}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_PEEK|MSG_DONTROUTE|MSG_CTRUNC|MSG_TRUNC|MSG_DONTWAIT|MSG_EOR|MSG_FIN|MSG_SYN|MSG_CONFIRM|MSG_NOSIGNAL|0x720000}, msg_len=33}], 2, MSG_NOSIGNAL) = 2 <0.000030>
[pid 12505] poll([{fd=3, events=POLLIN}], 1, 5000 <unfinished ...>
[pid 12505] <... poll resumed> ) = 1 ([{fd=3, revents=POLLIN}]) <0.014262>
[pid 12505] ioctl(3, FIONREAD, [49]) = 0 <0.000005>
[pid 12505] recvfrom(3, "\f\211\201\200\0\1\0\1\0\0\0\0\3www\7example\3com\0\0\1\0"..., 2048, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("8.8.8.8")}, [28->16]) = 49 <0.000006>
[pid 12505] poll([{fd=3, events=POLLIN}], 1, 4985 <unfinished ...>
[pid 12505] <... poll resumed> ) = 0 (Timeout) <4.990060>
[pid 12505] poll([{fd=3, events=POLLOUT}], 1, 0) = 1 ([{fd=3, revents=POLLOUT}]) <0.000020>
[pid 12505] sendto(3, "\f\211\1\0\0\1\0\0\0\0\0\0\3www\7example\3com\0\0\1\0"..., 33, MSG_NOSIGNAL, NULL, 0) = 33 <0.000044>
[pid 12505] poll([{fd=3, events=POLLIN}], 1, 5000) = 1 ([{fd=3, revents=POLLIN}]) <0.013273>
[pid 12505] ioctl(3, FIONREAD, [49]) = 0 <0.000010>
[pid 12505] recvfrom(3, "\f\211\201\200\0\1\0\1\0\0\0\0\3www\7example\3com\0\0\1\0"..., 2048, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("8.8.8.8")}, [28->16]) = 49 <0.000010>
[pid 12505] poll([{fd=3, events=POLLOUT}], 1, 4986) = 1 ([{fd=3, revents=POLLOUT}]) <0.000009>
[pid 12505] sendto(3, ">\2\1\0\0\1\0\0\0\0\0\0\3www\7example\3com\0\0\34\0"..., 33, MSG_NOSIGNAL, NULL, 0) = 33 <0.000017>
[pid 12505] poll([{fd=3, events=POLLIN}], 1, 4986) = 1 ([{fd=3, revents=POLLIN}]) <0.012769>
[pid 12505] ioctl(3, FIONREAD, [61]) = 0 <0.000014>
[pid 12505] recvfrom(3, ">\2\201\200\0\1\0\1\0\0\0\0\3www\7example\3com\0\0\34\0"..., 65536, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("8.8.8.8")}, [28->16]) = 61 <0.000015>
[pid 12505] close(3) = 0 <0.000020>
To distill that, in the good logs, we send a single A request to the DNS server (sendto), wait for it to finish (poll), and receive the response (recvfrom). However, in the bad logs, we send two UDP DNS messages simultaneously using sendmmsg (A and AAAA). We wait for the response (poll), receive one response (recvfrom), then wait for the second response (poll). However, that wait times out after 5 seconds! Curl then retries the DNS lookup using two different sendto/poll/recvfrom sessions.
When I saw this, I hooked up tcpdump and started tracing the packets at the router. Here's what I see:
04:04:48.692397 IP (my-ip).36694 > 8.8.8.8.53: 61393+ A? www.example.com. (33)
04:04:48.692482 IP (my-ip).36694 > 8.8.8.8.53: 32898+ AAAA? www.example.com. (33)
04:04:48.704735 IP 8.8.8.8.53 > (my-ip).36694: 61393 1/0/0 A 93.184.216.34 (49)
04:04:53.697739 IP (my-ip).36694 > 8.8.8.8.53: 61393+ A? www.example.com. (33)
04:04:53.709948 IP 8.8.8.8.53 > (my-ip).36694: 61393 1/0/0 A 93.184.216.34 (49)
04:04:53.710464 IP (my-ip).36694 > 8.8.8.8.53: 32898+ AAAA? www.example.com. (33)
04:04:53.723453 IP 8.8.8.8.53 > (my-ip).36694: 32898 1/0/0 AAAA 2606:2800:220:1:248:1893:25c8:1946 (61)
Note the long 5 second pause between receiving the A response and when curl gives up and re-requests both DNS responses! This didn't show up when I used "host" because host requests the DNS responses sequentially:
04:11:52.478714 IP (my-ip).50855 > 8.8.8.8.53: 1377+ A? example.com. (29)
04:11:52.496369 IP 8.8.8.8.53 > (my-ip).50855: 1377 1/0/0 A 93.184.216.34 (45)
04:11:52.497173 IP (my-ip).46645 > 8.8.8.8.53: 18575+ AAAA? example.com. (29)
04:11:52.522772 IP 8.8.8.8.53 > (my-ip).46645: 18575 1/0/0 AAAA 2606:2800:220:1:248:1893:25c8:1946 (57)
04:11:52.523431 IP (my-ip).35832 > 8.8.8.8.53: 32079+ MX? example.com. (29)
04:11:52.537376 IP 8.8.8.8.53 > (my-ip).35832: 32079 0/1/0 (86)
I also verified that this doesn't happen when using TCP DNS. If you add
options use-vc
to your /etc/resolv.conf
, you'll force TCP DNS. If I do so, the problem goes away, and tcpdump shows:
20:39:45.842568 IP 192.168.60.227.59698 > 8.8.8.8.53: Flags [S], seq 2229041048, win 29200, options [mss 1460,sackOK,TS val 1176074 ecr 0,nop,wscale 7], length 0
20:39:45.854362 IP 8.8.8.8.53 > 192.168.60.227.59698: Flags [S.], seq 2440518197, ack 2229041049, win 28400, options [mss 1432,sackOK,TS val 317973603 ecr 1176074,nop,wscale 7], length 0
20:39:45.854380 IP 192.168.60.227.59698 > 8.8.8.8.53: Flags [.], ack 1, win 229, options [nop,nop,TS val 1176077 ecr 317973603], length 0
20:39:45.854451 IP 192.168.60.227.59698 > 8.8.8.8.53: Flags [P.], seq 1:71, ack 1, win 229, options [nop,nop,TS val 1176077 ecr 317973603], length 7044577+ A? www.example.com. (68)
20:39:46.075149 IP 192.168.60.227.59698 > 8.8.8.8.53: Flags [P.], seq 1:71, ack 1, win 229, options [nop,nop,TS val 1176144 ecr 317973603], length 7044577+ A? www.example.com. (68)
20:39:46.087335 IP 8.8.8.8.53 > 192.168.60.227.59698: Flags [.], ack 71, win 222, options [nop,nop,TS val 317973835 ecr 1176144], length 0
20:39:46.088185 IP 8.8.8.8.53 > 192.168.60.227.59698: Flags [P.], seq 1:52, ack 71, win 222, options [nop,nop,TS val 317973835 ecr 1176144], length 5144577 1/0/0 A 93.184.216.34 (49)
20:39:46.088276 IP 192.168.60.227.59698 > 8.8.8.8.53: Flags [.], ack 52, win 229, options [nop,nop,TS val 1176147 ecr 317973835], length 0
20:39:46.099590 IP 8.8.8.8.53 > 192.168.60.227.59698: Flags [P.], seq 52:115, ack 71, win 222, options [nop,nop,TS val 317973848 ecr 1176147], length 6314553 1/0/0 AAAA 2606:2800:220:1:248:1893:25c8:1946 (61)
20:39:46.099687 IP 192.168.60.227.59698 > 8.8.8.8.53: Flags [.], ack 115, win 229, options [nop,nop,TS val 1176151 ecr 317973848], length 0
20:39:46.099818 IP 192.168.60.227.59698 > 8.8.8.8.53: Flags [F.], seq 71, ack 115, win 229, options [nop,nop,TS val 1176151 ecr 317973848], length 0
20:39:46.116915 IP 8.8.8.8.53 > 192.168.60.227.59698: Flags [F.], seq 115, ack 72, win 222, options [nop,nop,TS val 317973865 ecr 1176151], length 0
20:39:46.116925 IP 192.168.60.227.59698 > 8.8.8.8.53: Flags [.], ack 116, win 229, options [nop,nop,TS val 1176156 ecr 317973865], length 0
That's just your standard SYN, SYN/ACK, ACK handshake, followed by two DNS requests (tcpdump calls them both A, but I think either one is AAAA or they both have both requests). Then the server responds with both responses and we close the connection.
Here are the strace logs, just for completeness:
[pid 31020] connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("8.8.8.8")}, 16 <unfinished ...>
[pid 31020] <... connect resumed> ) = 0 <0.013562>
[pid 31020] writev(3, [{iov_base="\0!", iov_len=2}, {iov_base="1\251\1\0\0\1\0\0\0\0\0\0\3www\7example\3com\0\0\1\0"..., iov_len=33}, {iov_base="\0!", iov_len=2}, {iov_base="\275\2\1\0\0\1\0\0\0\0\0\0\3www\7example\3com\0\0\34\0"..., iov_len=33}], 4) = 70 <0.000026>
[pid 31020] read(3, <unfinished ...>
[pid 31020] <... read resumed> "\0001", 2) = 2 <0.243082>
[pid 31020] read(3, "1\251\201\200\0\1\0\1\0\0\0\0\3www\7example\3com\0\0\1\0"..., 49) = 49 <0.000008>
[pid 31020] read(3, "\0=", 2) = 2 <0.012144>
[pid 31020] read(3, "\275\2\201\200\0\1\0\1\0\0\0\0\3www\7example\3com\0\0\34\0"..., 61) = 61 <0.000015>
[pid 31020] close(3) = 0 <0.000029>
I tried connecting my laptop directly to the modem and observed the exact same behavior. A power-cycle of the modem "fixed" it. However, I added some logging to my grafana dashboard to track "curl" times and alert me if they went over 5 seconds. Sure enough, it came back today.
There's nothing in my modem's event log since I replaced a bad RG-59 cable a few days ago.
What do I do now? This is clearly a bug in the modem firmware, and just as clearly I am not going to be able to convince Comcast or Arris of it. Where can I go to get help with this?