What to expect in this doc:
- Traefik 2.0 has traffic mirroring functionality that should work on generic Kubernetes, but there's no good how-to guides, let this be the first.
- This is a how-to guide, that's optimized for understanding
- I'll cover setup, elaborate on useful background contextual information, and troubleshooting info.
- Test driven development is best, so there will be a test that proves without a doubt that the setup works as intended.
- One nice thing about working at DoiT is that we're encouraged to learn and occasionally lab things out to help go above and beyond when supporting our customers as well as given some time to do so.
- This guide was the result of that practice and was made to help support a doit customer.
- You need to provision a Kubernetes Cluster. I'm using a generic EKS cluster with default settings (EX: it uses the legacy LB controller, vs the aws-load-balancer-controller add-on.)
- You need to own (~$12) or have admin access to a DNS name (for easy HTTPS certificate generation) (I'll be using the DNS name neoakris.dev)
- The instructions and configuration assume DNS name neoakris.dev If you change that to your own, you'd need to update the config files to your DNS name.
- Assumptions I'm making:
- You have access to Unix bash/zsh terminal
- Common CLI tools like docker, kubectl, and helm are pre-installed
- Your ~/.kube/config context is pointing to the right cluster
- Bash#
code file.yaml
^-- That you understand this convention to mean edit file.yaml I'm using https://code.visualstudio.com/, that I configured to be runnable from the CLI, by following a random guide on the internet that involved adding
export PATH="$PATH:/Applications/Visual Studio Code.app/Contents/Resources/app/bin"
to my Mac's ~/.zshrc file.
You don't have to do this, you could replace code incode file.yaml
withvi file.yaml
,nano file.yaml
, etc. - I assume you're doing this lab in an isolated sandbox AWS account
Don't try this lab in a staging/prod AWS account, ideally it should be done in an isolated sandbox AWS account vs a shared dev account. That said it's safe if you follow directions. Remember it's a general best practice to do any random how-to guide on the internet in an isolated sandbox AWS account for defense in depth reasons. (As this would isolate the blast radius of the scenario were something bad happens due to a mistake you made following the how-to guide or if you followed a poisoned how-to guide where someone tricks you into running an exploit payload.)
see step 4/5 security awareness notice for specific info on how to maximize safety.
-
Review traefik.helm-values.yaml, It's 48 lines and contains useful background contextual information as comments. Also, it has an annotation that gets applied to Kubernetes service of type LB, that's specific to AWS, so if you wanted to try to use this steps outside of AWS you may need to edit annotations to provision a CSP L4 LB. (Cloud Service Provider Layer 4(TCP) Load Balancer.)
-
Install the helm chart, aws users should be able to copy and paste as is.
helm repo add traefik https://traefik.github.io/charts
helm repo update
mkdir -p ~/traefik-lab
cd ~/traefik-lab
curl https://gist.githubusercontent.com/neoakris/8ce77dab88868de0f5206bc9c482cfab/raw/fc4d3317f2de3b3b5627aaa38dbd645f9c05bb6e/traefik.helm-values.yaml > traefik.helm-values.yaml
head traefik.helm-values.yaml
ls -la
helm upgrade --install traefik traefik/traefik --version 21.1.0 --values=traefik.helm-values.yaml --namespace=traefik --create-namespace=true
kubectl get pods -n=traefik
Background Context:
- Let's Encrypt is a non-profit Free Internet Certificate Authority.
- We'll use an interactive shell to provision the HTTPS cert using a generic methodology
- The wildcard cert we generate will work for https://neoakris.dev and https://*.neoakris.dev (but not subdomains)
- You'll need to replace neoakris.dev with a domain name you own / control.
Step 2 Instructions:
- Start the cert provisioning process
# [admin@workstation:~/traefik-lab]
mkdir -p ~/traefik-lab/cert
cd ~/traefik-lab/cert
docker run -it --entrypoint=/bin/sh --volume $HOME/traefik-lab/cert:/.lego/certificates docker.io/goacme/lego:latest
# [shell@dockerized-ACME-client:/]
lego --email "[email protected]" --domains="*.neoakris.dev" --dns "manual" run
- The terminal will say something along the lines of
...
Do you accept the TOS? Y/n
...
2023/03/16 15:05:58 [INFO] [*.neoakris.dev] acme: Obtaining bundled SAN certificate
2023/03/16 15:05:59 [INFO] [*.neoakris.dev] AuthURL: https://acme-v02.api.letsencrypt.org/acme/authz-v3/211356108927
2023/03/16 15:05:59 [INFO] [*.neoakris.dev] acme: use dns-01 solver
2023/03/16 15:05:59 [INFO] [*.neoakris.dev] acme: Preparing to solve DNS-01
lego: Please create the following TXT record in your neoakris.dev. zone:
_acme-challenge.neoakris.dev. 120 IN TXT "WKnGHot_TzrzkKIwMpLwrymEZr6m3ZyQEQsEcG5C4Bo"
lego: Press 'Enter' when you are done
-
Manually update your authoritative DNS nameserver. (In my case I went to my domain registrar domains.google.com, clicked on the neoakris.dev entry and verified it was configured to use google's domain servers vs delegated ("Your domain is using Google Domains name servers"), then I created a custom TXT record
_acme-challenge TXT 300 WKnGHot_TzrzkKIwMpLwrymEZr6m3ZyQEQsEcG5C4Bo -
Once done, I pressed
enter
in the terminal that was waiting for human input. Within a minute the DNS update had finished propagating (If slow, you can speed it up by setting your laptop's DNS to match, 8.8.8.8 = Google DNS, 1.1.1.1 = Cloudflare DNS.) -
The terminal will say something along the lines of
...
2023/03/16 15:23:13 [INFO] [*.neoakris.dev] acme: Waiting for DNS record propagation.
...
2023/03/16 15:23:20 [INFO] [*.neoakris.dev] The server validated our request
2023/03/16 15:23:20 [INFO] [*.neoakris.dev] acme: Cleaning DNS-01 challenge
lego: You can now remove this TXT record from your neoakris.dev. zone:
_acme-challenge.neoakris.dev. 120 IN TXT "..."
2023/03/16 15:23:20 [INFO] [*.neoakris.dev] acme: Validations succeeded; requesting certificates
2023/03/16 15:23:21 [INFO] [*.neoakris.dev] Server responded with a certificate.
Use the cert files to generate and apply a kube secret containing an HTTPS wildcard cert
# [shell@dockerized-ACME-client:/]
exit
# [admin@workstation:~/traefik-lab/cert]
ls
# _.neoakris.dev.crt _.neoakris.dev.issuer.crt _.neoakris.dev.json _.neoakris.dev.key
cd ~/traefik-lab
export B64_CERT=$(cat ~/traefik-lab/cert/_.neoakris.dev.crt | base64)
export B64_KEY=$(cat ~/traefik-lab/cert/_.neoakris.dev.key | base64)
echo "$B64_CERT"
echo "$B64_KEY"
# ^-- Looking for a really long string of gibberish
# Basically left shifted smoke test b4 moving on
tee https-wildcard-cert.yaml << EOF
apiVersion: v1
kind: Secret
metadata:
name: https-wildcard-cert # covers both https://neoakris.dev AND https://*.neoakris.dev
namespace: traefik
type: kubernetes.io/tls
data:
tls.crt: $B64_CERT
tls.key: $B64_KEY
EOF
cat https-wildcard-cert.yaml
kubectl apply -f https-wildcard-cert.yaml
- Grab Example file
- modify as needed in terms of DNS names
IMPORTANT:
Change the username and password, using the method mentioned in the yaml comment if you leave it default and someone logs in to your traefik instance. In theory a bad actor could do something along the lines of install a plugin that could in theory be used to privilege escalate to kubectl cluster-admin, and your EKS cluster probably has some IAM rights for the AWS account it resides in, that could be used to privilege escalate into an AWS account. (This is why isolated sandbox/dev aws accounts are a known best practice.) I immediately shut down my cluster after going live with this for that reason.- make sure you edit the username/password in the file before applying as long as you do that it's safe security wise, it's extremely dangerous to apply without 1st editing the username/password to a secure value.
- kubectl apply
# [admin@workstation:~/traefik-lab]
curl https://gist.githubusercontent.com/neoakris/8ce77dab88868de0f5206bc9c482cfab/raw/fc4d3317f2de3b3b5627aaa38dbd645f9c05bb6e/traefik_dashboard_and_default_tls.yaml > traefik_dashboard_and_default_tls.yaml
head traefik_dashboard_and_default_tls.yaml
code traefik_dashboard_and_default_tls.yaml
# ^-- update username password and DNS name from neoakris.dev to yours as needed
# and read any notes about background contextual info useful to understanding
kubectl apply -f traefik_dashboard_and_default_tls.yaml
kubectl get services -n=traefik
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
traefik LoadBalancer 172.20.27.24 a279531e078e34a97ae9762df4c2ac52-888690101675fd0b.elb.us-east-2.amazonaws.com 80:30116/TCP,443:30321/TCP 175m
- I see my L4 LB has a CNAME of a279531e078e34a97ae9762df4c2ac52-888690101675fd0b.elb.us-east-2.amazonaws.com
- In domains.google.com (the spot you update internet DNS is likely different) I added a custom record like this *.neoakris.dev CNAME 300 a279531e078e34a97ae9762df4c2ac52-888690101675fd0b.elb.us-east-2.amazonaws.com
- Bash#
nslookup dashboard.neoakris.dev
Server: 8.8.8.8
Address: 8.8.8.8#53
Non-authoritative answer:
dashboard.neoakris.dev canonical name = a279531e078e34a97ae9762df4c2ac52-888690101675fd0b.elb.us-east-2.amazonaws.com.
Name: a279531e078e34a97ae9762df4c2ac52-888690101675fd0b.elb.us-east-2.amazonaws.com
Address: 3.141.154.55
Name: a279531e078e34a97ae9762df4c2ac52-888690101675fd0b.elb.us-east-2.amazonaws.com
Address: 3.136.170.240
- Since that looks good I visit the website to verify https://dashboard.neoakris.dev
- I see no HTTPS errors and I get prompted for a username and password
I copy and paste the values embedded in the earlier config yaml file
Username: et9B6fBZUYeDOmzvukiquYw5KrCqKy
Password: QREqS/DVPF6dye3/FE30UCS8S5pwID
(Chrome will cache them so the gibberish passwords won't be too annoying since I won't need to enter them every time.)
You can expect to see something like this
- Memory backed redis makes the app stateful (this will be useful in validating traffic mirroring later)
- redis is useful for testing as we can easily factory reset the state by rebooting the redis pod
- otherwise it gives us statefulness with minimal dependencies / configuration.
# [admin@workstation:~/traefik-lab]
curl https://gist.githubusercontent.com/neoakris/8ce77dab88868de0f5206bc9c482cfab/raw/fc4d3317f2de3b3b5627aaa38dbd645f9c05bb6e/blue.helm-values.yaml > blue.helm-values.yaml
curl https://gist.githubusercontent.com/neoakris/8ce77dab88868de0f5206bc9c482cfab/raw/fc4d3317f2de3b3b5627aaa38dbd645f9c05bb6e/green.helm-values.yaml > green.helm-values.yaml
head green.helm-values.yaml
# Inspect values and update dns name as needed
code blue.helm-values.yaml
code green.helm-values.yaml
helm upgrade --install podinfo oci://ghcr.io/stefanprodan/charts/podinfo --values=blue.helm-values.yaml --namespace blue --create-namespace=true
helm upgrade --install podinfo oci://ghcr.io/stefanprodan/charts/podinfo --values=green.helm-values.yaml --namespace green --create-namespace=true
The 2 websites https://green.neoakris.dev and https://blue.neoakris.dev look like
# [admin@workstation:~/traefik-lab]
curl https://gist.githubusercontent.com/neoakris/8ce77dab88868de0f5206bc9c482cfab/raw/fc4d3317f2de3b3b5627aaa38dbd645f9c05bb6e/green-with-traffic-mirroring-to-blue.yaml > green-with-traffic-mirroring-to-blue.yaml
# Inspect file / edit DNS names as needed. The only object that should
# need to be updated is the IngressRoute custom resource object near the end.
code green-with-traffic-mirroring-to-blue.yaml
kubectl apply -f green-with-traffic-mirroring-to-blue.yaml
https://green-with-traffic-mirroring-to-blue.neoakris.dev/
Now shows the green website (and behind the scenes is mirroring incoming traffic to the blue website)
Notes of interest:
- One of the great things about Traefik is that it's written in Go, so rarely has CVEs. Meaning, there's less risk relative to alternative options, when implementing the pattern of configure it once, then don't update for a really long time.
- Traefik 2.0 is finicky in terms of Kubernetes
- lack of solid Kubernetes specific docs
- poor UX (User Experience) of configuring traefik using Kubernetes CRs (custom resources) lack of custom resource config validation, results in a poor feedback loop when config is invalid. If your syntax is slightly off, or you're missing a value, a traefik specific Kubernetes Custom Resource Object's config might not get loaded into traefik, which you can tell by observing the traefik dashboard between changes. There are some edge cases where you'd need to reboot traefik pod for removed Kubernetes objects to be removed from traefik.
- Traefik 3.0 is working to improve the Kubernetes UX (User Experience)
https://traefik.io/blog/traefik-proxy-3-0-scope-beta-program-and-the-first-feature-drop/
Explanation of the YAML config file:
- DNS exists at multiple levels.
- Inner Cluster DNS: are DNS names resolvable only by pods / workloads running in the cluster.
- LAN/VPC DNS: are DNS names that are resolvable only by VMs on the LAN / in the VPC.
- Internet DNS: are DNS names resolvable by any machine on the internet.
- pods can resolve all 3
- This solution leverages Kubernetes services of type ExternalName
- Kubernetes services generate inner cluster DNS names
Usually of the form $SERVICE_NAME.$NAMESPACE_NAME.svc.cluster.local - ExternalName services are similar to DNS CNAMES / think of them as DNS based redirects
- Kubernetes services generate inner cluster DNS names
- primary-route service in the mirror namespace:
creates inner cluster dns name primary-route.mirror (short for FQDN: primary-route.mirror.svc.cluster.local)
that redirects to podinfo.green.svc.cluster.local (which is an inner cluster fully qualified domain name, but could easily be updated to a public internet dns name) - mirror-route service in the mirror namespace:
creates inner cluster dns name mirror-route.mirror (short for FQDN: mirror-route.mirror.svc.cluster.local)
that redirects to podinfo.blue.svc.cluster.local (which is an inner cluster fully qualified domain name, but could easily be updated to a public internet dns name) - The reason services of type externalName are used is so this how-to guide acts as an exemplar (ideal example)
to show the most flexible implementation option.
- This config allows us to mirror traffic to kube service's in other namespaces
- This config also allows us to mirror traffic to generic dns names on the internet, such as one hosted on an external eks cluster. (you could mirror traffic to the DNS name of a staging cluster for example)
- The TraefikService (which is a Kubernetes Custom Resource that allows you to configure traffic mirroring in Traefik)
seems to have a limitation where it only allows mirroring to traffic within the same namespace where the TraefikService exists. (using a DNS redirect, via service of type ExternalName, works around this limitation.)
- Pointing out an interesting oddity
- The TraefikService named primary-route-with-mirror points to port 9898
kubectl get service -n=mirror # NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE # mirror-route ExternalName <none> podinfo.blue.svc.cluster.local <none> 9h # primary-route ExternalName <none> podinfo.green.svc.cluster.local <none> 9h # ^-- I gave them port none, b/c the yaml value wouldn't be respected anyways, it's the service they point to # that decides the port listened on # # The following can be used to verify the ExternalName kubernetes services do in fact work kubectl run -it curl --image=docker.io/curlimages/curl -- sh & kubectl exec -it curl -- curl -t0 primary-route.mirror:9898 kubectl exec -it curl -- curl -t0 mirror-route.mirror:9898 # ^-- you'll get feedback that they work kubectl delete pod curl
- What is a traefik traffic mirror? / the nature of a traffic mirror? How to interpret the mirrors part of the yaml?
apiVersion: traefik.containo.us/v1alpha1 kind: TraefikService metadata: name: primary-route-with-mirror namespace: mirror spec: mirroring: kind: Service name: primary-route port: 9898 mirrors: - kind: Service name: mirror-route port: 9898 percent: 100
- ^-- Note the original YAML has more comments, for this section I want to elaborate on how to interpret the mirrors list.
- There can only be 1 primary service that a mirror points to. (the primary service has bidirectional communication with the client) so when you run
curl https://green-with-traffic-mirroring-to-blue.neoakris.dev/
it's ONLY the primary service (green.neoakris.dev) that responds back to the client. - You can have multiple mirrors (- denotes a yaml array list), the mirrors get unidirectional communication with the calling client. So when the client runs
curl https://green-with-traffic-mirroring-to-blue.neoakris.dev/
the mirror (blue.neoakris.dev) will receive traffic generated by the client, but blue won't be allowed to respond back to the client. (This should make sense as blue represents WIP code that we want to test, so you wouldn't want potentially buggy code being able to respond back to client users.)
- Set Variable to correct test value
export DOMAIN=neoakris.dev
echo $DOMAIN
- Add Data to stateful endpoints
curl -X POST -d "green persistence test value" https://green.$DOMAIN/cache/test-key
curl -X POST -d "blue persistence test value" https://blue.$DOMAIN/cache/test-key
- Fetch Data from stateful endpoints
curl https://green.$DOMAIN/cache/test-key
(Returns: green persistence test value)
curl https://blue.$DOMAIN/cache/test-key
(Returns: blue persistence test value) - Use Mirror to push data to both stateful endpoints at the same time
curl -X POST -d "mirrored data push" https://green-with-traffic-mirroring-to-blue.$DOMAIN/cache/mirror-test
- Fetch Data to validate if mirror worked
curl https://green.$DOMAIN/cache/mirror-test
(Returns: mirrored data push)
curl https://blue.$DOMAIN/cache/mirror-test
(Returns: mirrored data push)