Skip to content

Instantly share code, notes, and snippets.

@vkz
Last active April 24, 2023 10:12
Show Gist options
  • Save vkz/96b84164936552b7afef2acf24e68351 to your computer and use it in GitHub Desktop.
Save vkz/96b84164936552b7afef2acf24e68351 to your computer and use it in GitHub Desktop.
How to debug load-balanced routing for your website

How to debug web resource routing behind a load-balancer

Assume we have a web-server deployed on-prem behind AWS Application Load Balancer. AWS's docs re how elastic load balancing works is the best source of info re balancing. Note, that ELB requires at least two nodes, but ideally as many as many availability zones in your region. London has 3, so lets go with that. What this means is that ELB is in fact 3 EC2 instances doing the load-balancing.

Ore setup:

  1. ELB in eu-west-2 (London) with 3 availability zones, each with an instance of our ELB.
  2. Web server serving our website on prem accessible via Tailnet IP address.
  3. Each availability zone runs a Tailsale relay nano-instance, so that ELB may have access to our on-prem server via Tailnet.
  4. Each subnet (each availability zone gets its own subnet by default) has a routing table (defined via VPC console) that routes Tailnet 100.64.0.0/10 to a tailnet-relay instance deployed in that subnet.
  5. Target group, with our taillnet on-prem IP address as target, that allows traffic from ELB's security group.

Why do we need 3 and 4, when by default all 3 subnets have default route that lets them access each other? Because AWS VPC routing tables don't let you route to an IP address. They only allow you to specify an instance or a network interface, but really either of these end up pointing to a specific network interface, which can only exist in one subnet. This is thet only reason we need 3 relays - one per subnet and separate tables routing to our tailnet.

Assuming the above setup was performed correctly, it should just work and e.g. curl -Ikl example.com should succeed no matter how many times we run it. However, note that because we load-balance among 3 availability zones, what this really means is that we have 3 routes - 3 ways for request to get to our on-prem resource. We can easily confirm that by checking:

dig example.com

# Which should respond with exactly 3 IP addresses (one per availability zone - which is to say one per datacenter location)

;; ANSWER SECTION:
example.com.		60	IN	A	18.134.50.86
example.com.		60	IN	A	3.11.23.187
example.com.		60	IN	A	35.176.164.170

If there is a problem with a single route, we'd expect our above curl -Ikl example.com to succeed exactly twice and fail exactly once, when we run it 3 times in a row. Precisely, because ELB defaults to round-robin routing: it simply cycles all 3 routes (or IP addresses, or availability zones).

Obviousely, our load-balancer needs rules for ports 80 and 443 that match our host example.com (and possibly its subdomains) and forward HTTP requests over to our webserver - on-prem resource in our case.

How do we investigate a case of unavailable connection then? Usually a server connection time-out i.e. 504 or 503 HTTP error. First, maybe worth reading through official how to troubleshoot 503 AWS post, but if the issue is connectivity, firewalls and permissions, then our main tool is curl --resolve which see in manual and also this blog post.

Essentially, by supplying --resolve to curl we can force to resolve our web resource as specific IP, so e.g. for our 3 routes above we could force request to take each route:

curl -IkL --resolve '*:443:35.176.164.170' https://example.com
curl -IkL --resolve '*:443:3.11.23.187' https://example.com
curl -IkL --resolve '*:443:18.134.50.86 https://example.com

notice how we resolved against each load-balancer address we have serving us, effectively diverting our request to each of our 3 availability zones.

The very same technique could be used to see if routing within your subnet or between subnets actually works as you expect. What you need is to spawn a tiny EC2 instance inside the subnet (or its neighbour in another availability zone) and then use the --resolve trick above, ping, traceroute, etc. Remember however that AWS VPC defaults to closing everything, so even host in the same subnet won't be able to talk to one another (even ICMP traffic that's required for ping and tarceroute, won't pass): you'll want to tweak your instances' security groups - most likely their respictive INBOUND rules to allow such traffic from other hosts of interest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment