I wrote this down after I responded to a page today (a holiday) because it would've been a decent pairing opportunity for a couple of new people on my team. Second best is that people can read what I did afterwards and ask me any questions. And then I realized that there's nothing PagerDuty-specific or confidential in here, so I may as well share it wider. It's hardly an epic incident, but it's a good example of "doing the work", I think. I borrowed the "write down what you learned" approach from Julia "b0rk" Evans. It's a fantastic practice.
The PagerDuty incident: "Disk will be full in 12 hours. device:/dev/nvme0n1p1, host:stg-nomadusw2-client-..."
(Note for non-PD readers: We run Nomad where others might run Kubernetes.)
Here's the process I went through.
- Noticed that the usual
docker system prune -a -f
didn't resolve it - Tried
docker system prune -a -f
and it cleared up 0B
Learned: It's not stale docker image layers.
- Looked at
du -sh | grep dev
and saw/mnt
was 77% full. (We bind-mount various filesystems under /mnt. I don't love it.) - Figured it's probably
/var
from previous experience - Did
cd /var; du -sh * | sort -h
(both "h" mean "human format" eg "1.02GB"):
0 lock
0 run
4.0K crash
4.0K local
4.0K opt
16K tmp
20K vault
36K snap
244K mail
264K consul
796K spool
1.9M backups
75M cache
84M awslogs
318M chef
4.9G log
128G lib
Learned: It's /var/lib
- Did
du -sh * | sort -h
in/var/lib
and so forth, until I narrowed it down to/var/lib/docker/overlay2
Learned: It's docker overlay2 layers, but previously learned it's not stale image layers.
- Is it active images? No,
docker images
shows nothing near 70+ GB.
Learned: It's not a docker image.
- Kept descending into the filesystem with
du -sh | sort -h
. - Wished I had
ncdu
which makes this much easier. apt install ncdu
because hey why not? We can throw this host away after, I'm responding to a (minor) incident and I have freedom to install diagnostic tools by hand.- Tracked down to:
/var/lib/docker/overlay2/face4015.../merged/opt/kafka_2.13-2.8.0/logs
Learned: It's some Kafka-related logs in the layer face4015...
Problem: How do I find out what container owns that layer?
- Stack Overflow had nothing going in that direction, only container-to-layer.
- Idea: Explore the rest of the filesystem under
overlay2/face4015.../merged
. - Discovered
/run.sh
under that directory - Ran
docker ps | grep run.sh
, which output (roughly):
454f4d73bc17 nomadic-mirrormaker-datadog:5b17 "/run.sh" 27 hours ago Up 27 hours (healthy) 8125/udp, 8126/tcp datadog-cec12ae6-9fd9-bfed-7e39-2aeacc448b81
c90e6050ec7c nomadic-mirrormaker:5b17 "/run.sh" 27 hours ago Up 27 hours 26946->1099/tcp, 26946->1099/udp
Learned: It's one of those two "nomadic-mirrormaker" containers, whatever that is.
- Guess: It's probably not the datadog sidecar.
- Noted for later: Why is there a datadog sidecar? Containers can reach the datadog agent on the host.
- Looked up
nomadic-mirrormaker
in the Nomad UI - Noticed that the "owner" tag on the job is "dbre"
Learned: It's owned by the DBRE team.
Resolution: Left a note in that team's slack channel, and will reschedule the job if things fill up, which will force it to restart with a brand-new container and thus no logs yet. Also added a note about using datadog sidecars.
Update: I have just learned about docker ps --size
, about 30 minutes too late! But I know it for next time now.
--
By the way, docker system prune -a -f
is pronounced "docker system prune as fuck". You're welcome.
A very nice writeup, thank you. I would gobble up a writeup about you team/company' s experience running nomad(there are not a lot of those in the public domain).