The incident management steps I have in mind when being on-call and getting an alert are:
- Verify the issue
- Triage
- Communicate and scalate if needed
- Mitigate
- Troubleshoot
- Postmortem
As general troubleshooting or debugging technique:
- Do not make things worse (eg, don’t randomly change things you are not familiar with, know when changes are hard to roll back).
- Communicate with the team. Take notes as you go of what you see and what you changed. A chat medium like Slack is good since it also keeps a timeline (We want to have backup ways of communicating). Communicate intent and be specific (eg “going to restart the x database in y host”). Acknowledge other people’s messages. Make sure everybody knows who’s got controls so people don’t step on each other. Usually you want one person leading troubleshooting and doing the changes and other people supporting by checking things, communicating with Customer Support and other teams etc.
- Try to divide the problem space. Ideally by two but don’t need to start strictly in a systematic way if you have strong historical indicators of where the problems have been.
- Test what has worked before. (But if you have to fix the exact same issue more than once or twice then this would be a huge indicator of poor engineering practices).
- Do earlier the tests that are fast to do which also give relevant information.
- If quick and initial tests failed, then it’s often a good idea to pause, step back and restart debugging in a more systematic fashion, testing more basic assumptions and validating with other people your mental model of how things are supposed to work.
# load:
# what it does
netstat -tlpn # package net-tools
ps auxf
# memory:
# vmstat
# r: runnable (running or waiting to run in queue)
# b: uninterruptible sleep (D in ps)
vmstat # summary
vmstat 1 5 -w # every 1 sec, print 5 . wide .(first line is summary since reboot)
vmstat -s # summary memory stats
free -m
grep -i oom /var/log/messages ( /var/log/syslog )
# CPU:
# package sysstat:
mpstat -P ALL # cpu balance
pidstat 1
pidstat -p $pid
# disk:
vmstat -d
df -h
df -i
iostat -xz 1
# biggest files in / :
du -mxS / |sort -n|tail -10
# network:
# package sysstat
sar -n DEV 1 # network throughput
sar -n TCP,ETCP 1 # TCP stats (also: ss -s)
# distro:
cat /etc/debian_version
lsb_release -a # apt-get install lsb-release
# boot
dmesg |tail
last -a
journalctl -n 20 --no-pager -u nginx # last lines for a specific unit
journalctl --since yesterday --until "1 hour ago" # takes 2022-12-24, 08:00 ...
journalctl -k # kernel messages, dmesg
journalctl -p err # 0, 1, 2, 3 ...
dmesg | tail
tail /var/log/messages ( /var/log/syslog /var/log/kern.log )
systemctl # same as systemctl list-units
systemctl cat <service> # shows location and contents of config file for <service>
systemctl list-unit-files # lists if they are masked (won't start, use unmask option)
systemctl reload unit # reload options after changes, install
systemctl --failed
systemd-analyze # startup time, append 'blame' for breakdown
fdisk -l
df -lT # -l local, -T type
lsblk -f # filesystem
file -s /dev/hda1
blkid /dev/hda1
cat /etc/fstab
fsck.ext4 -p /dev/sda1 # check and fix, if dirty
xfs_repair -n /dev/sda # scan
xfs_repair /dev/sda # scan and fix
ss -s
netstat -s
netstat -i
ip -s link
lsof -i
sar -n DEV
ip route
netstat -r
iptables -L
iptables -t nat -L # does not show with -L
curl options:
curl -v
curl -I # header info
curl -L # follow location
curl -O # download original name
uname -a
sysctl -a
strace -p $pid # running program
strace -c $program # run & summary
strace -e trace=write # filter
Also info under /proc/$pid/
openssl x509 -in /path/to/server/certificate -text
openssl s_client -connect
openssl s_client -connect -servername -showcerts | openssl x509 -text -noout
ulimits for current Bash session set at /etc/security/limits.conf
su - username -c 'ulimit -a'
cat /proc/cgroups
docker ps -a
docker stats --all
docker logs <container>
docker inspect <container>
docker diff <container> # files changed
docker top <container>
docker update --help # update memory/cpu settings running container:
docker update -m 10M -c 2 <container>
# override entrypoint or command:
docker run -it --entrypoint /bin/bash <image>
docker run -it <image> /bin/bash
kubectl cluster-info
kubectl get events --sort-by=.metadata.creationTimestamp
kubectl get pods --show-labels -o wide
kubectl top node my-node
kubectl api-resources
kubectl explain pods
kubectl rollout history deployment/frontend
kubectl rollout undo deployment/frontend -to-revision=3
kubectl rollout restart deployment/frontend
kubectl logs mypod --since 2m
kubectl logs mypod --previous
# CrashLoopBackOff: can't pul image, image with bad CMD.
# Deploy image with sleep command
kubectl describe ingress myingress
kubectl port-forward svc/my-service 5000
kubeval my-invalid.yaml
kubectl diff -f ./my-manifest.yaml
kubectl debug -it yourpod --image=busybox:1.28 --target=yourpod
kubectl debug myapp -it --image=ubuntu --share-processes --copy-to=myapp-debug
kubectl debug myapp -it --copy-to=myapp-debug -- sh
kubectl debug node/mynode -it --image=ubuntu
From dnsutils
dig +short
, dig @ns_ip
: resolves and tells you what DNS server you are using.
Critical files:
/etc/nsswitch.conf # order of resolving
/etc/resolv.conf # nameservers
/etc/hosts # hard-coded hostname-ip maps
Test configuration: nginx -t
Test and dump config: nginx -T
etcdctl get --prefix --keys-only /
etcdctl get "" --prefix=true --keys-only
etcdctl endpoint status --write-out=table # json to get version
etcdctl member list --write-out=table
etcdctl alarm list
grep "[CE] |" etcd.log
grep "apply entries took too long" etcd.log
etcdctl compact $version