writing this for search engine indexing more than anything. TLDR: kubernetes cluster DNS messed up. Had to rebuild by cluster
I recently ran into an issue with my kubernetes cluster not being able to run the container that I was building. This container was a relativly simple golang program. do a get request, parse the json, insert into remote database.
when running in my cluster (Self Rolled High Avalibility if that matters), the program panicked on a x509 error. specifically x509: certificate is valid for pfSense-5d68c9b017846, not api.pathofexile.com
(for indexing) x509: certificate is valid for a, not b x509: certificate is valid for x, not y
My first instinct was that the container I was using didnt have the certs necessary. As such, it didnt actually trust the site. This was because I was building it using the stripped down container method.
what I SHOULD have done first however was test from my local. That would have told me a lot. Including that the container worked perfectly fine when NOT in my cluster.
Step 2 should have been to run it on docker on my kubernetes cluster but not IN the cluster itself. Then I could have determined if the network setup was broken, or if the cluster was broken.
Step 3 at that point which I did eventually get to, was to get into the container and test the connection manually. This is normally done with kubectl exec. https://kubernetes.io/docs/tasks/debug-application-cluster/get-shell-running-container/
Once inside, Ping the detination site. I got my public ip. This tells me that, for some reason, the DNS is messed up when the container is deployed inside the cluster.
Sanity checked by changing /etc/resolv.conf echo nameserver 1.1.1.1 > /etc/resolv.conf if you cant edit it live. Then not only the ping, but the program worked EXACTALLY as intended.
Unfortunatly, This is where my actual problem SOLVING ended as I got no further in figuring out HOW the DNS was broken. I have decided at this point that when I deployed my cluster I managed to mess up the dns somehow. As such, the easiest solution (given they were vms) would be to redeploy the cluster. Less of a scalpal and more of a hammer.
should I run into the same issue I will ammend this gist.