Skip to content

Instantly share code, notes, and snippets.

View Tejeev's full-sized avatar

Tejeev Tejeev

  • Rancher
  • Colorado
View GitHub Profile
@azadkuh
azadkuh / vim-cheatsheet.md
Last active November 8, 2024 09:05
vim / vimdiff cheatsheet - essential commands

Vim cheat sheet

Starting Vim

vim [file1] [file2] ...

@patrick0057
patrick0057 / README.md
Last active September 25, 2020 13:08
Major disaster preparation and recovery

Major disaster preparation and recovery

In a perfect world our clusters would never experience a complete and total failure where data from all nodes is unrecoverable. Unfortunately this scenario is very possible and has happened before. In this article I will outline how to best prepare your environment for recovery in situations like this.

Situation: Employee A accidentally deletes all of the VM's for a production cluster after testing his latest script. How do you recover?

Option A: Keep VM snapshots of all of the nodes so that you can just restore them if they are deleted.

Option B: Manually bootstrap a new controlplane and etcd node to match one of the original nodes that were deleted.

In this article, I'm going to focus on Option B. In order to bootstrap a controlplane,etcd node, you will need an etcd snapshot, Kubernetes certificates and the runlike commands from the core Kubernetes components. If you prepare ahead of time for something like this, you can save a lot of time when it comes

@patrick0057
patrick0057 / README.md
Last active June 17, 2023 10:05
etcd performance testing and optimization

etcd performance testing and optimization

If your etcd logs start showing messages like the following, your storage might be too slow for etcd or the server might be doing too much for etcd to operate properly.

2019-08-11 23:27:04.344948 W | etcdserver: read-only range request "key:\"/registry/services/specs/default/kubernetes\" " with result "range_response_count:1 size:293" took too long (1.530802357s) to execute

If you storage is really slow you will even see it throwing alerts in your monitoring system. What can you do the verify the performance of your storage? If the storage is is not performing correctly, how can you fix it? After researching this I found an IBM article that went over this extensively. Their findings on how to test were very helpful. The biggest factor is your storage latency. If it is not well below 10ms in the 99th percentile, you will see warnings in the etcd logs. We can test this with a tool called fio which I will outline below.

Testing etcd per

@mattmattox
mattmattox / rke_recovery.sh
Last active May 28, 2024 22:09
Recovering cluster.yml and cluster.rkestate from kubeconfig
#!/bin/bash
echo "Building cluster_recovery.yml..."
echo "Working on Nodes..."
echo 'nodes:' > cluster_recovery.yml
kubectl --kubeconfig kube_config_cluster.yml -n kube-system get configmap full-cluster-state -o json | jq -r .data.\"full-cluster-state\" | jq -r .desiredState.rkeConfig.nodes | yq r - | sed 's/^/ /' | \
sed -e 's/internalAddress/internal_address/g' | \
sed -e 's/hostnameOverride/hostname_override/g' | \
sed -e 's/sshKeyPath/ssh_key_path/g' >> cluster_recovery.yml
echo "" >> cluster_recovery.yml