If you are experiencing issues with containers communicating to each other in your Rancher 1.6 environment, your ipsec might be having some issues. In this article I will go over common troubleshooting steps and procedures to correct the problem.
- exec into one of your ipsec-router containers and run the following ipsec test
for i in `curl -s rancher-metadata/latest/self/service/containers/| cut -f1 -d=` ; do ping -c2 `curl -s curl rancher-metadata/latest/self/service/containers/$i/primary_ip` ; done
If all containers or a majority are not responding then there is likely an issue with ipsec that needs to be addressed. Usually when there are ipsec issues, it is because metadata is having issues getting in sync. To confirm this, check your metadata logs (Infrstructure stacks> network-services> metadata>) and look at the "Download and reload in" time. If it is hovering around 10 seconds or greater then this is most likely your problem. We generally want this value to be 1-2 seconds. Below is a sample of what this looks like.
The metadata container is a database that runs on every host in an environment. Infrastructure containers on each host rely on their local metadata database for information that allows them to run correctly. The data that is retrieved by metadata is serialized, so if it detects that it is out of date it will grab the data again until it is in sync. On a system that downloads and reloads in 10 seconds, the metadata container will be stuck in a perpetual loop of not having the correct data. This will result in infrastructure containers on that host to not work as expected.
IPsec usually has issues when there are more than 50 host in an environment. Rancher's official recommendation is that you have no more than 50 hosts in an environment. If you need more, we recommend scaling your hosts vertically or creating a separate environment. If you are still having issues or cannot for some reason scale down your environment right away then you can try increasing the CPU allowance to the metadata stack.
To check metadata CPU usage, we need to go to infrastructure stacks then click on network-services. In network-services click "Up to date" in the top right corner. Then select the latest template version in the drop down menu to reveal the settings. You should see settings similar to the screenshot below.
The number on the left is the CPU Period which indicates a number that represents a full CPU core. The number on the right is the CPU quota which indicates how much CPU we want to allow metadata to use. By default we only allow metadata to use 1/2 of a core. In larger environments you can increase this value to correct ipsec issues.
To increase 1/2 core to 2 cores for example, you could change the above CPU Quota number from 200000 to 800000. Once you save changes the containers will go through a rolling upgrade, this can take a while depending on how overloaded your environment is and how many hosts are in it. Once the rolling update is complete, test your ipsec connectivity again to ensure that it is working as expected.