Steps I took to troubleshoot a full disk

I wrote this down after I responded to a page today (a holiday) because it would've been a decent pairing opportunity for a couple of new people on my team. Second best is that people can read what I did afterwards and ask me any questions. And then I realized that there's nothing PagerDuty-specific or confidential in here, so I may as well share it wider. It's hardly an epic incident, but it's a good example of "doing the work", I think. I borrowed the "write down what you learned" approach from Julia "b0rk" Evans. It's a fantastic practice.

The PagerDuty incident: "Disk will be full in 12 hours. device:/dev/nvme0n1p1, host:stg-nomadusw2-client-..."

(Note for non-PD readers: We run Nomad where others might run Kubernetes.)

Here's the process I went through.

Noticed that the usual docker system prune -a -f didn't resolve it
Tried docker system prune -a -f and it cleared up 0B

Learned: It's not stale docker image layers.

Looked at du -sh | grep dev and saw /mnt was 77% full. (We bind-mount various filesystems under /mnt. I don't love it.)
Figured it's probably /var from previous experience
Did cd /var; du -sh * | sort -h (both "h" mean "human format" eg "1.02GB"):

0   lock
0   run
4.0K    crash
4.0K    local
4.0K    opt
16K tmp
20K vault
36K snap
244K    mail
264K    consul
796K    spool
1.9M    backups
75M cache
84M awslogs
318M    chef
4.9G    log
128G    lib

Learned: It's /var/lib

Did du -sh * | sort -h in /var/lib and so forth, until I narrowed it down to /var/lib/docker/overlay2

Learned: It's docker overlay2 layers, but previously learned it's not stale image layers.

Is it active images? No, docker images shows nothing near 70+ GB.

Learned: It's not a docker image.

Kept descending into the filesystem with du -sh | sort -h.
Wished I had ncdu which makes this much easier.
apt install ncdu because hey why not? We can throw this host away after, I'm responding to a (minor) incident and I have freedom to install diagnostic tools by hand.
Tracked down to: /var/lib/docker/overlay2/face4015.../merged/opt/kafka_2.13-2.8.0/logs

Learned: It's some Kafka-related logs in the layer face4015...

Problem: How do I find out what container owns that layer?

Stack Overflow had nothing going in that direction, only container-to-layer.
Idea: Explore the rest of the filesystem under overlay2/face4015.../merged.
Discovered /run.sh under that directory
Ran docker ps | grep run.sh, which output (roughly):

454f4d73bc17 nomadic-mirrormaker-datadog:5b17 "/run.sh"  27 hours ago  Up 27 hours (healthy)   8125/udp, 8126/tcp                                                                                                                                                                                   datadog-cec12ae6-9fd9-bfed-7e39-2aeacc448b81
c90e6050ec7c nomadic-mirrormaker:5b17         "/run.sh"  27 hours ago  Up 27 hours             26946->1099/tcp, 26946->1099/udp

Learned: It's one of those two "nomadic-mirrormaker" containers, whatever that is.

Guess: It's probably not the datadog sidecar.
Noted for later: Why is there a datadog sidecar? Containers can reach the datadog agent on the host.
Looked up nomadic-mirrormaker in the Nomad UI
Noticed that the "owner" tag on the job is "dbre"

Learned: It's owned by the DBRE team.

Resolution: Left a note in that team's slack channel, and will reschedule the job if things fill up, which will force it to restart with a brand-new container and thus no logs yet. Also added a note about using datadog sidecars.

Update: I have just learned about docker ps --size, about 30 minutes too late! But I know it for next time now.

-- By the way, docker system prune -a -f is pronounced "docker system prune as fuck". You're welcome.

wlonkly/debugging.md

The PagerDuty incident: "Disk will be full in 12 hours. device:/dev/nvme0n1p1, host:stg-nomadusw2-client-..."

rugwirobaker commented Jul 2, 2021

Uh oh!

timurakhmadeev commented Jul 2, 2021 •

edited

Loading

Uh oh!

catherio commented Jul 8, 2021

Uh oh!

wlonkly commented Jul 8, 2021 •

edited

Loading

Uh oh!

wlonkly/debugging.md

The PagerDuty incident: "Disk will be full in 12 hours. device:/dev/nvme0n1p1, host:stg-nomadusw2-client-..."

rugwirobaker commented Jul 2, 2021

Uh oh!

timurakhmadeev commented Jul 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

catherio commented Jul 8, 2021

Uh oh!

wlonkly commented Jul 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

timurakhmadeev commented Jul 2, 2021 •

edited

Loading

wlonkly commented Jul 8, 2021 •

edited

Loading