This document is a guide to Linux systems troubleshooting, with a slant towards CoreOS Container Linux.
- TCP: Transmission Control Protocol, a network protocol for reliable, error-checked in-order data transmission
- UDP: User Datagram Protocol, a network protocol for unreliable low-latency data transmission
- TLS: Transport Layer Security, an encryption protocol layered over TCP. Successor to SSL (Secure Sockets Layer).
- container: A mechanism for isolating processes from the rest of the system and restricting the resources which processes can use.
- cron: Classically, a daemon that triggered other processes on defined repeating schedules. In modern systems, a generic term for any scheduled trigger. Named after the word chronos, which means time in Greek.
- daemon: A process that runs in the background. Pronouned "demon" or "daymon". Named after the Maxwell's Daemon thought experiment from physics.
- flag/option: A modifer passed to a command preceded by one or two dashes. Examples:
command -a -b
,command -ab
,command --help
- fork: To create a clone.
- kernel: The core software of the operating system which directly manages the hardware.
- manpage: A command's documentation. Short for manual page. On most Linux systems you can access a manpage by running
man <command or topic>
. You can also googleman <topic>
to find them online. - process: An instance of a running program.
- shell: A command line program which provides a scriptable interactive computing environment. The two most common shells are POSIX Shell (
sh
), a simple and older shell, and Bourne Again Shell (bash
), which is relatively newer. - socket: An endpoint that allows processes to communicate with each other. Unix domain sockets allow communication between processes on the same computer. Network sockets allow communication between processes on the same or different computers.
- userspace/userland: The part of the system which is not the kernel.
If you aren't familiar with the fundamentals of *nix command lines, you should learn the following:
- How to run commands in a shell (e.g. Bash)
- Common commands (manpages are your friend!):
cat
head
/tail
ls
pwd
mkdir
cd
pushd
/popd
touch
cp
mv
rm
grep
sed
jq
find
scp
/rsync
tar
gzip
/unzip
vi
/vim
(vimtutor
can help)make
- The differences between short flags, long flags and arguments and how to use them to change the behavior of programs
- How to use pipes (
|
) to chain commands together - How to use wildcards/globs (
*
) to select multiple files in a single command - How to use basic regular expressions to pattern-match text
- Basic control structures:
if; then; elif; fi
,for; do; done
,while; do; done
,until; do; done
- How to write simple Bash scripts using variables
- How the shell environment works (environment variables, TTYs,
$PATH
,export
)
Without these skills the rest of this document will be close to impossible to apply in practice!
systemd is the primary process on most modern Linux systems. It is the first process that starts after the kernel boot and sets up the rest of the system. You can interact with systemd using the command systemctl
.
systemd is configured using unit files, each of which describe a piece of the system (such as a mounted drive/volume, a process, a cron timer...). There are a few places unit files are read from but the main one operators deal with is /etc/systemd/system/
. Units have names like foo.service
(defines a daemon) or foo.timer
(defines a cron trigger).
systemctl
(no arguments): Display all of the units and their statuses.systemctl list-units
: Display units which are either currently active, are pending or have failed.systemctl list-units --all
: Display all units.systemctl list-timers
: Display timer units. Timer units are triggers for starting other units automatically on a repeating schedule.sudo systemctl start <unit>
: Start the given unit, if it is not already active.sudo systemctl enable <unit>
: Configure the given unit to start automatically whenever the system boots.sudo systemctl disable <unit>
: Configure the given unit to NOT start automatically on boot.sudo systemctl restart <unit>
: Stop the given unit if it is active, and then start it.systemctl status <unit>
: Show the given unit's status.systemctl cat <unit>
: Show the definition of a unit.sudo systemctl daemon-reload
: If you edit a unit file, you need to run this command to load the change.systemctl reboot
: Reboot the system.
systemd has excellent manpages.
systemd sends the logs for both the kernel and all of the userspace processes it manages to the system journal. You can query the journal with journalctl
.
You can type journalctl
, which shows all of the logs from every process since the system last booted. You can run journalctl -f
to follow the logs live, which often feels like that scene from The Matrix where the guy is watching the lines of code flow down the screen.
To view the logs for a specific unit, run journalctl -u <unit>
. You can combine the follow and unit options to follow the logs for a specific unit with journalctl -fu <unit>
. This is a very handy command you can easily remember because you are saying "Eff You!" to whatever broken piece of software is currently ruining your life.
See the excellent journalctl manpage. There are many options to modify the output, filter on various parameters and show logs from previous boot cycles.
Processes are running instances of software. Every process has a unique PID (Process ID). Processes do NOT have unique names! Processes can spawn other processes, which is referred to as a parent process spawning child processes. When a process creates a copy of itself, it is said to fork. The relationships between all processes on the system makes up the process tree.
ps
(process snapshot) shows the processes currently running.
By default, ps
only shows the processes running under your current shell, which is not very useful. To see every process running on the system, run ps -e
.
By default, ps
only shows the name of the process. To see the full command arguments, run ps -f
By default, ps
displays a table. To see a hierarchical representation of the process tree, run ps -H
You can combine all of these commands with ps -He
or ps -Hef
to see a complete snapshot of the process tree.
If want to see a specific process only, run ps -p <PID>
.
If you know the name of some processes and want to get their PIDs, you can use pgrep
to search for them by name or other attributes.
You can send signals to processes to trigger behaviors.
To send a process to a signal, run kill -s <signal name> <PID>
There are quite a few signals for historical purposes, but only four are commomly used by operators:
SIGINT
(Interupt): Signals that a program should be interupted. This is the signal sent when you hit Ctrl+c on a foreground process. Most programs stop immediately, but you can write a program to gracefully shut down or even ignore this signal instead.SIGHUP
(Hang Up): Signals that a program should reload without stopping. Commonly used with load balancers and other processes designed to rarely restart. The name is a historical artifact from back int eh days of remote terminals and mainframes, and was later repurposed.SIGTERM
(Terminate): Signals that a program should gracefully and promptly stop. Most programs stop quickly, but you can write a program to slowly shut down or even ignore this signal instead.SIGKILL
(Kill): Forcefully and immediately halts a program. Cannot be ignored.
strace
can be used to show the kernel syscalls a userspace process is making. When you have a bizzare issue that seems impossible to debug, strace
is your guardian angel.
strace <command> <args>
: Run a program as a child process of strace.strace -p <PID>
: Start tracing another process which is already running.strace -c [...]
: Produce a histogram of syscall metrics. Useful for performance profiling. See also-C
,-S
See strace.io for a quick cheatsheet.
You can send an ICMP Type 8 Echo packet (aka a ping):
ping <address>
This doesn't really tell you anything useful, though. Lots of systems will allow ICMP Type 8 traffic but block or have issues with other protocols. It's like checking if your favorite restaurant is open by checking if the road to their building is still there.
ip
is the userland interface for the kernel networking capabilities.
ip link
: Show layer 1 and 2 interfaces (MAC addresses)ip neighbor
: Show ARP neighborsip address
: Show layer 1, 2 and 3 interfaces (IP addreses)ip route
: Show routing table
Some people use ifconfig
, arp
and route
for these. Those people are wrong.
ss
(socket search) queries the sockets on a system.
ss -ltnp
: Display listening TCP sockets by number and the processes bound to them.ss -lunp
: Display listening UDP sockets by number and the processes bound to them.ss -xp
: Display Unix Domain Sockets and the processes bound to them.
nmap
is a security scanner that can be used to probe systems. It's the l33t haxx0r warez you see in tons of Hollywood movies. Sometimes people will use telnet
for TCP probing, but telnet is for n00bs!
nmap -sTU <address>
: Scan all TCP and UDP port on an address. If security is doing their job, this will be noticed!nmap -sT -p <port> <address>
: Probe a specific TCP port.nmap -sU -p <port> <address>
: Probe a specific UDP port.
If the probe hangs for a long time, there's probably a firewall dropping your packets. If it returns quickly with the state "closed" there's probably nothing listening.
Many networks will detect the port scans/probes and mask them. You can sometimes work around this by using -Pn
to skip host discovery, which is a little stealthier.
nmap -Pn -p <port> <address>
: Stealth-probe a specific TCP port.
nmap
is an amazing tool. It's also got spoofing/stealth/evasion/reconaissance/offensive capabilities... it just happens to be a nice troubleshooting tool on the side.
Sometimes the best way to debug a really weird problem is to record the network traffic (a packet capture) and inspect it in Wireshark.
You can take a packet cature using tcpdump
:
# Capture all traffic through the etho0 interface and save it to a file
tcpdump -i etho0 -w /tmp/$(date +%s).pcap
You can then transfer the .pcap
file to a laptop and open it in Wireshark to analyse the traffic. Note that encrypted traffic requires the encryption keys to view, and certain encryption algorithms (such as Perfect Forward Secrecy TLS ciphers) are not possible to decrypt even if you have the key.
Further reading:
- tcpdump practical examples
- Wireshark documentation:
dig
is super useful for troubleshooting DNS.
# General form
dig @<resolver> <record>
dig @1.1.1.1 www.example.com
# Omit the resolver to use the system's configured resolvers
dig www.example.com
# Don't need the full output?
dig +short www.example.com
Advice about troubleshooting network problems:
It's not DNS.
There's no way it's DNS.
It was DNS.
This site explains how DNS works.
curl
can be used to troubleshoot HTTP/S connections.
# General form
curl -X <method> -i <url>
# GET request with response headers
curl -X GET -i https://www.example.com
# JSON payload
curl -X POST -H "Content-Type:application/json" --data '{"payload": "foobar"}' https://www.example.com
If it's available, HTTPie (http
) is way better:
# General form
http <method> <address> <payload>
# GET request with response headers
http -h get https://www.example.com
# GET request following redirects
http -f get google.com
# JSON payload
http post https://www.example.com payload=foobar
# URL parameters
http get https://www.example.com param==foobar
stat
: Display the status of a file. This has some useful output like ownership, permissions and modification timestamps.lsof
: List open files. Since everything is a file in *nix, this can also query sockets and all kinds of system resourcesmount
: Show mounted filesystemsdf -h
: Show disk usage by filesystemdf -i
: Show inode usage by filesystemdu -ch
: Show disk usage by file
Docker is a popular interface for sharing images (portable filesystems and environments that include everything an application needs to run) and launching containers from those images.
docker login
: Authenticate to a Docker registrydocker pull <image>
: Download a Docker image from a registrydocker push <image>
: Upload a Docker image to a registrydocker images
: Show Docker images present on systemdocker run -it <image>
: Launch a container in interactive modedocker attach <container id>
: Attach interactively to a running containerdocker run -it --rm alpine
: Launch a throwaway Alpine Linux container (handy for troubleshooting!)docker ps
: Show running containersdocker ps -a
: Show all containersdocker stats
: Show resource utilization by containerdocker logs <container id>
: Show logs from a containerdocker stop <container id>
: Stop a containerdocker kill <container id>
: Likedocker stop
but angrier