- Who is causing the load (e.g., PID, process name, UID, IP address)?
- Why is the load called (code path, stack trace, flame graph)?
- What is the load (IOPS, throughput, type)?
- How is the load changing over time (per-interval summaries)?
- Start examining the highest level.
- Examine next-level details.
- Pick the most interesting breakdown or clue.
- If the problem is unsolved, go back to step 2.
For every resource, check:
- Utilization
- Saturation
- Errors
-
uptime
-
dmesg | tail
-
vmstat 1
r: The number of processes running on CPU and waiting for a turn. This provides a better signal than load averages for determining CPU saturation, as it does not include I/O. To interpret: an "r" value greater than the CPU count indicates saturation.
free: Free memory, in Kbytes. If there are too many digits to count, you probably have enough free memory. The free -m command, included in Section 3.3.7 better explains the state of free memory.
si and so: Swap-ins and swap-outs. If these are non-zero, you’re out of memory. These are only in use if swap devices are configured.
us, sy, id, wa, and st: These are breakdowns of CPU time, on average, across all CPUs. They are user time, system time (kernel), idle, wait I/O, and stolen time (by other guests, or, with Xen, the guest’s own isolated driver domain).
-
mpstat -P ALL 1
This command prints per-CPU time broken down into states. The output reveals a problem: CPU 0 has hit 100% user time, evidence of a single-thread bottleneck.
Also look out for high %iowait time, which can be explored with disk I/O tools, and high %sys time, which can be explored with syscall and kernel tracing, as well as CPU profiling
-
pidstat 1
-
iostat -xz 1
-
free -m
-
sar -n DEV 1
-
sar -n TCP,ETCP 1
-
top
kubectl debug -n ${NAMESPACE} -it ${POD} --image=debian:10.9-slim --target=${CONTAINER}
apt update && apt install -y htop sysstat procps
- execsnoop
- opensnoop
- ext4slower (or btrfs*, xfs*, zfs*)
- biolatency
- biosnoop
- cachestat
- tcpconnect
- tcpaccept
- tcpretrans
- runqlat
- profile
We can use flamegraph directly from flamegraph-rs.
E.g.:
flamegraph -o /tmp/timer_flamegraph.svg timer 10s
execsnoop(8)3 is a BCC and bpftrace tool that traces new process execution system-wide. It can find issues of short-lived processes that consume CPU resources and can also be used to debug software execution, including application start scripts.
exitsnoop(8)6 is a BCC tool that traces when processes exit, showing their age and exit reason. The age is the time from process creation to termination, and includes time both on and off CPU. Like execsnoop(8), exitsnoop(8) can help debug issues of short-lived processes, providing different information to help understand this type of workload.
runqlat(8)8 is a BCC and bpftrace tool for measuring CPU scheduler latency, often called run queue latency (even when no longer implemented using run queues). It is useful for identifying and quantifying issues of CPU saturation, where there is more demand for CPU resources than they can service. The metric measured by runqlat(8) is the time each thread (task) spends waiting for its turn on CPU.
runqlen(8)11 is a BCC and bpftrace tool for sampling the length of the CPU run queues, counting how many tasks are waiting their turn, and presenting this as a linear histogram. This can be used to further characterize issues of run queue latency or as a cheaper approximation.
runqslower(8)12 is a BCC tool that lists instances of run queue latency exceeding a configurable threshold and shows the process that suffered the latency and its duration.
cpudist(8)13 is a BCC tool for showing the distribution of on-CPU time for each thread wakeup. This can be used to help characterize CPU workloads, providing details for later tuning and design decisions.
cpufreq(8)14 samples the CPU frequency and shows it as a system-wide histogram, with per-process name histograms. This only works for CPU scaling governors that change the frequency, such as powersave, and can be used to determine the clock speed at which your applications are running.
Flame graphs are visualizations of stack traces that can help you quickly understand profile(8) output. They were introduced in Chapter 2.
To support flame graphs, profile(8) can produce output in folded format using -f: Stack traces are printed on one line, with functions separated by semicolons. For example, writing a 30-second profile to an out.stacks01 file and including kernel annotations (-a):
cd /tmp
git clone https://github.com/brendangregg/FlameGraph
cd FlameGraph
profile -af 30 > out.stacks01
./flamegraph.pl --color=java < out.stacks01 > out.svg
offcputime(8)16 is a BCC and bpftrace tool to summarize time spent by threads blocked and off CPU, showing stack traces to explain why. For CPU analysis, this tool explains why threads are not running on a CPU. It’s a counterpart to profile(8); between them, they show the entire time spent by threads on the system: on-CPU time with profile(8) and off-CPU time with offcputime(8).
cd /tmp
git clone https://github.com/brendangregg/FlameGraph
cd FlameGraph
offcputime -fKu 5 > out.offcputime01.txt
./flamegraph.pl --hash --bgcolors=blue --title="Off-CPU Time Flame Graph" \
< out.offcputime01.txt > out.offcputime01.svg
syscount(8)19 is a BCC and bpftrace tool for counting system calls system-wide. It is included in this chapter because it can be a starting point for investigating cases of high system CPU time.
argdist(8) and trace(8) are introduced in Chapter 4, and are BCC tools that can examine events in custom ways. As a follow-on from syscount(8), if a syscall was found to be called frequently, you can use these tools to examine it in more detail.
For example, the read(2) syscall was frequent in the previous syscount(8) output. You can use argdist(8) to summarize its arguments and return value by instrumenting either the syscall tracepoint or its kernel functions. For the tracepoint, you need to find the argument names, which the BCC tool tplist(8) prints out with the -v option:
# tplist -v syscalls:sys_enter_read
syscalls:sys_enter_read
int __syscall_nr;
unsigned int fd;
char * buf;
size_t count;
argdist -H 't:syscalls:sys_enter_read():int:args->count'
funccount(8), introduced in Chapter 4, is a BCC tool that can frequency-count functions and other events. It can be used to provide more context for software CPU usage, showing which functions are called and how frequently. profile(8) may be able to show that a function is hot on CPU, but it can’t explain why20: whether the function is slow, or whether it was simply called millions of times per second.
softirqs(8) is a BCC tool that shows the time spent servicing soft IRQs (soft interrupts). The system-wide time in soft interrupts is readily available from different tools. For example, mpstat(1) shows it as %soft. There is also /proc/softirqs to show counts of soft IRQ events. The BCC softirqs(8) tool differs in that it can show time per soft IRQ rather than event count.
hardirqs(8)21 is a BCC tool that shows time spent servicing hard IRQs (hard interrupts). The system-wide time in hard interrupts is readily available from different tools. For example, mpstat(1) shows it as %irq. There is also /proc/interrupts to show counts of hard IRQ events. The BCC hardirqs(8) tool differs in that it can show time per hard IRQ rather than event count.
smpcalls(8)22 is a bpftrace tool to trace and summarize time in the SMP call functions (also known as cross calls). These are a way for one CPU to run functions on other CPUs, including all other CPUs, which can become an expensive activity on large multi-processor systems. For example, on a 36-CPU system:
llcstat(8)23 is a BCC tool that uses PMCs to show last-level cache (LLC) miss rates and hit ratios by process. PMCs are introduced in Chapter 2.
Tool | Type | Description |
---|---|---|
dmesg | Kernel log | OOM killer event details |
swapon | Kernel statistics | Swap device usage |
free | Kernel statistics | System-wide memory usage |
ps | Kernel statistics | Process statistics, including memory usage* |
pmap | Kernel statistics | Process memory usage by segment |
vmstat | Kernel statistics | Various statistics, including memory |
sar | Kernel statistics | Can show page fault and page scanner rates* |
perf | Software events, hardware statistics, hardware sampling | Memory-related PMC statistics and event sampling |
*ps output:
- %MEM: The percentage of the system’s physical memory in use by this process
- VSZ: Virtual memory size
- RSS: Resident set size: the total physical memory in use by this process
*sar: sar -B 1
heaptrack
: memory profiler for Linux
bpftrace -e 'software:major-faults:1 { @[comm, pid] = count(); }'
bpftrace -e 'software:major-faults:1 { printf("%s %s\n", comm, str(pid)); }'
Tool | Type | Description |
---|---|---|
df | ||
mount | ||
strace | ||
perf | ||
fatrace | fatrace(1) is a specialized tracer that uses the Linux fanotify API (file access notify) |
opensnoop(8)4 was shown in Chapters 1 and 4, and is provided by BCC and bpftrace. It traces file opens and is useful for discovering the location of data files, log files, and configuration files. It can also discover performance problems caused by frequent opens, or help troubleshoot issues caused by missing files.
biolatency(8)2 is a BCC and bpftrace tool to show block I/O device latency as a histogram. The term device latency refers to the time from issuing a request to the device, to when it completes, including time spent queued in the operating system.
biosnoop(8)3 is a BCC and bpftrace tool that prints a one-line summary for each disk I/O.