You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If you are experiencing random crashing of your Rancher 2.x pods or docker container, sometimes Rancher support will ask you to take a go routine dump or a memory dump. Below are the commmands you need to run inside of your Rancher container or pod.
#exec into Rancher pod or container first
mkdir dumps
curl localhost:6060/debug/pprof/goroutine -o dumps/goroutine
curl localhost:6060/debug/pprof/heap -o dumps/heap
curl localhost:6060/debug/pprof/threadcreate -o dumps/threadcreate
curl localhost:6060/debug/pprof/block -o dumps/block
curl localhost:6060/debug/pprof/mutex -o dumps/mutex
06.03.2020 - added zsh function that works for WSL
04.21.2020 - updated mac os x code to work with zsh and improved instructions.
01.02.2020 - added Windows Subsystem for Linux and broke out each OS into its own section for easy copy and paste
12.10.2019 - added support for Rancher 1.6 tar.gz files (requires gtar on mac)
12.06.2019 - made command lazier by not requiring user to paste the IP.
Description
Quick bash function to make my life easier when sshing into Rancher nodes. Make sure to update your default web browser download directory by modifying line 2 of the script. For mac: brew install findutils
Why is my requested memory unit now set to 'm' and what does it mean?
Below I'm archiving a slightly modified quote from Alena Prokharchyk in this Gist as it contains useful information that I would like to reference later.
Memory resources in Kubernetes are mesured in bytes, and can be expressed as an integer with one of these suffixes: E, P, T, G, M, K - decimal suffixes, or Ei, Pi, Ti, Gi, Mi, Ki - binary > suffixes (more commonly used for memory), or omit the suffix altogether (https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#meaning-of-memory). Lowercase "m" > is not a recommended suffix for memory.
Cluster.requested.memory is comprised of corresponding node.requested.memory. node.requested.memory is the sum of requested.memory of all the pods scheduled on this node.
While most of the nodes have memory with binary suffixes - memory": "2630Mi", two of them have a suffix "m":
Restoring RKE cluster with incorrect or missing rkestate file
Overview
When using RKE 0.2.0 and newer, if you have restored a cluster with the incorrect rkestate file you will end up a state where your infrastructure pods will not start. This includes all pods in kube-system, cattle-system and ingress-nginx. As a result of these core pods not starting, all of your workload pods will be unable to function correctly. If you find yourself in this situation you can use the directions below to fix the cluster.
Recovery
Delete all service-account-token secrets in kube-system, cattle-system and ingress-nginx namespaces.
If you are experiencing issues with containers communicating to each other in your Rancher 1.6 environment, your ipsec might be having some issues. In this article I will go over common troubleshooting steps and procedures to correct the problem.
exec into one of your ipsec-router containers and run the following ipsec test
for i in `curl -s rancher-metadata/latest/self/service/containers/| cut -f1 -d=` ; do ping -c2 `curl -s curl rancher-metadata/latest/self/service/containers/$i/primary_ip` ; done
If the kube-apiserver is in a restart loop, it is possible that one of the etcd servers it is trying to connect to is no longer reachable. It should be able to just move on to the next etcd server but in some rare cases it does not. In those situations you need to remove the bad etcd servers from its startup options as shown below.
Get runlike command for kube-apiserverwith the following command:
docker run --rm -v /var/run/docker.sock:/var/run/docker.sock axeal/runlike kube-apiserver
If your etcd logs start showing messages like the following, your storage might be too slow for etcd or the server might be doing too much for etcd to operate properly.
2019-08-11 23:27:04.344948 W | etcdserver: read-only range request "key:\"/registry/services/specs/default/kubernetes\" " with result "range_response_count:1 size:293" took too long (1.530802357s) to execute
If you storage is really slow you will even see it throwing alerts in your monitoring system. What can you do the verify the performance of your storage? If the storage is is not performing correctly, how can you fix it? After researching this I found an IBM article that went over this extensively. Their findings on how to test were very helpful. The biggest factor is your storage latency. If it is not well below 10ms in the 99th percentile, you will see warnings in the etcd logs. We can test this with a tool called fio which I will outline below.
In a perfect world our clusters would never experience a complete and total failure where data from all nodes is unrecoverable. Unfortunately this scenario is very possible and has happened before. In this article I will outline how to best prepare your environment for recovery in situations like this.
Situation: Employee A accidentally deletes all of the VM's for a production cluster after testing his latest script. How do you recover?
Option A: Keep VM snapshots of all of the nodes so that you can just restore them if they are deleted.
Option B: Manually bootstrap a new controlplane and etcd node to match one of the original nodes that were deleted.
In this article, I'm going to focus on Option B. In order to bootstrap a controlplane,etcd node, you will need an etcd snapshot, Kubernetes certificates and the runlike commands from the core Kubernetes components. If you prepare ahead of time for something like this, you can save a lot of time when it comes