IPMI TroubleShootings

Infinite Loading of Supermicro BMC

Raw

Firmware memory corruption due to forced shutdown of server
- Actually happened on Samsung PM983 NVMe SSD.

Raw

Backup your ETCD data to the safe area.
Open the etcd.env file on one of your ETCD cluster nodes and append below.
- ETCD_FORCE_NEW_CLUSTER=true
- ETCD_INITIAL_CLUSTER=(remove the broken nodes)
Restart etcd service.
Check whether etcd service is running.
- Check whether the broken nodes are removed from the member list.
Remove the ETCD_FORCE_NEW_CLUSTER flag and restart etcd service again.
Wait some minutes and check whether your kubernetes cluster is recovered.
- Restarting kubelet is recommended: it will recover broken core k8s services.
- Restarting your provisioning services are recommeded.
- Rebooting the nodes will resolve most of the issues about containers.

Raw

Backup your data to the safe area
- ETCD: /opt/etcd/ /etc/etcd /etc/etcd.env
- Control Plane: /etc/kubernetes /var/lib/kubelet
- Rook Ceph: /var/lib/rook
Drain the nodes
Reinstall the OS
- Rook Ceph: DO NOT WIPE THE DATA VOLUME
Restore the data and reinstall the K8S
Undrain the nodes

Raw

Add an ETCD node to existing kubernetes ETCD cluster.
- etcdctl member add [new-node-name] --peer-urls=https://[new-node-ip]:2380
- You may use cert files to grant the command like below:
  - --cacert /etc/etcd/ssl/ca.pem
  - --cert /etc/etcd/ssl/admin-[old-node-k8s-name].pem
  - --key /etc/etcd/ssl/admin-[old-node-k8s-name]-key.pem
Update /etc/kubernetes/manifests/kube-apiserver.yaml.
- --etcd-servers=https://[new-node-ip]:2379
- The kubernetes manifest directory may be differ (i.e. Kubespray)
Restart kubelet service.
- systemctl restart kubelet.service
Wait some seconds and check the K8S cluster is running.
Remove the old ETCD node from your cluster.
- etcdctl member remove [old-node-id]