Troubleshooting

Intro

The incident management steps I have in mind when being on-call and getting an alert are:

Verify the issue
Triage
Communicate and scalate if needed
Mitigate
Troubleshoot
Postmortem

As general troubleshooting or debugging technique:

Do not make things worse (eg, don’t randomly change things you are not familiar with, know when changes are hard to roll back).
Communicate with the team. Take notes as you go of what you see and what you changed. A chat medium like Slack is good since it also keeps a timeline (We want to have backup ways of communicating). Communicate intent and be specific (eg “going to restart the x database in y host”). Acknowledge other people’s messages. Make sure everybody knows who’s got controls so people don’t step on each other. Usually you want one person leading troubleshooting and doing the changes and other people supporting by checking things, communicating with Customer Support and other teams etc.
Try to divide the problem space. Ideally by two but don’t need to start strictly in a systematic way if you have strong historical indicators of where the problems have been.
Test what has worked before. (But if you have to fix the exact same issue more than once or twice then this would be a huge indicator of poor engineering practices).
Do earlier the tests that are fast to do which also give relevant information.
If quick and initial tests failed, then it’s often a good idea to pause, step back and restart debugging in a more systematic fashion, testing more basic assumptions and validating with other people your mental model of how things are supposed to work.

Linux server overview

Review:

# load:
uptime

# what it does
netstat -tlpn # package net-tools
ps auxf

# memory:
# vmstat
# r: runnable (running or waiting to run in queue)
# b: uninterruptible sleep (D in ps)
vmstat # summary
vmstat 1 5 -w # every 1 sec, print 5 . wide .(first line is summary since reboot)
vmstat -s # summary memory stats

free -m
grep -i oom /var/log/messages ( /var/log/syslog )

# CPU:
top

# package sysstat:
mpstat -P ALL # cpu balance

lscpu

pidstat
pidstat 1
pidstat -p $pid

# disk:
vmstat -d
df -h
df -i
iostat -xz 1

# biggest files in / :
du -mxS / |sort -n|tail -10

# network:
# package sysstat
sar -n DEV 1 # network throughput
sar -n TCP,ETCP 1 # TCP stats (also: ss -s)

# distro:
cat /etc/debian_version
lsb_release -a # apt-get install lsb-release

# boot
dmesg |tail
last -a

Logging

journalctl
journalctl -n 20 --no-pager -u nginx # last lines for a specific unit
journalctl --since yesterday --until "1 hour ago" # takes 2022-12-24, 08:00 ...
journalctl -k # kernel messages, dmesg
journalctl -p err # 0, 1, 2, 3 ...

dmesg | tail
tail /var/log/messages  ( /var/log/syslog /var/log/kern.log )

systemd

systemctl # same as systemctl list-units
systemctl cat <service> # shows location and contents of config file for <service>
systemctl list-unit-files # lists if they are masked (won't start, use unmask option)
systemctl reload unit # reload options after changes, install

systemctl --failed 
systemd-analyze # startup time, append 'blame' for breakdown

Filesystems and volumes

fdisk -l
df -lT # -l local, -T type
lsblk -f # filesystem
file -s /dev/hda1
blkid /dev/hda1

mount
cat /etc/fstab

fsck.ext4 -p /dev/sda1  # check and fix, if dirty

xfs_repair -n /dev/sda  # scan
xfs_repair /dev/sda     # scan and fix

Networking

ss -s
netstat -s
netstat -i
ip -s link
ifconfig
lsof -i
sar -n DEV

ip route
netstat -r

iptables -L
iptables -t nat -L # does not show with -L

curl options:

curl -v
curl -I # header info
curl -L # follow location
curl -O # download original name

nic:

/etc/sysconfig/network-scripts/ifcfg-eth0
DEVICE="eth0"
BOOTPROTO="dhcp"

Kernel

uname -a
sysctl -a

strace

strace -p $pid        # running program
strace -c $program    # run & summary
strace -e trace=write # filter

Also info under /proc/$pid/

SSL/TLS

openssl x509 -in /path/to/server/certificate -text
openssl s_client -connect example.com:443
	HEAD / HTTP/1.1
	Host: example.com

openssl s_client -connect example.com:443 -servername example.com -showcerts | openssl x509 -text -noout

cgroups

ulimits for current Bash session set at /etc/security/limits.conf

su - username -c 'ulimit -a'

cat /proc/cgroups

Docker

docker ps -a
docker stats --all

docker logs <container>
docker inspect  <container>
docker diff <container>     # files changed
docker top <container>

docker update --help # update memory/cpu settings running container:
docker update -m 10M -c 2 <container>

# override entrypoint or command:
docker run -it --entrypoint /bin/bash <image>
docker run -it <image> /bin/bash

Kubernetes

kubectl cluster-info
kubectl get events --sort-by=.metadata.creationTimestamp
kubectl get pods --show-labels -o wide
kubectl top node my-node
kubectl api-resources
kubectl explain pods

kubectl rollout history deployment/frontend
kubectl rollout undo deployment/frontend -to-revision=3
kubectl rollout restart deployment/frontend

kubectl logs mypod --since 2m
kubectl logs mypod --previous

# CrashLoopBackOff: can't pul image, image with bad CMD.
# Deploy image with sleep command

kubectl describe ingress myingress
kubectl port-forward svc/my-service 5000 

kubeval my-invalid.yaml
kubectl diff -f ./my-manifest.yaml

# https://kubernetes.io/docs/tasks/debug-application-cluster/debug-running-pod/

kubectl debug -it yourpod --image=busybox:1.28 --target=yourpod
kubectl debug myapp -it --image=ubuntu --share-processes --copy-to=myapp-debug
kubectl debug myapp -it --copy-to=myapp-debug -- sh
kubectl debug node/mynode -it --image=ubuntu

DNS

host example.com

From dnsutils package:

dig +short example.com , dig @ns_ip example.com

nslookup example.com : resolves and tells you what DNS server you are using.

Critical files:

/etc/nsswitch.conf # order of resolving
/etc/resolv.conf   # nameservers
/etc/hosts         # hard-coded hostname-ip maps

Applications

nginx

Test configuration: nginx -t Test and dump config: nginx -T

etcd

etcdctl get --prefix --keys-only /
etcdctl get "" --prefix=true --keys-only
etcdctl endpoint status --write-out=table  # json to get version
etcdctl member list --write-out=table 
etcdctl alarm list

curl http://127.0.0.1:2379/health

grep "[CE] |" etcd.log
grep "apply entries took too long" etcd.log

etcdctl compact $version

fduran/Troubleshooting.md