I wrote this down after I responded to a page today (a holiday) because it would've been a decent pairing opportunity for a couple of new people on my team. Second best is that people can read what I did afterwards and ask me any questions. And then I realized that there's nothing PagerDuty-specific or confidential in here, so I may as well share it wider. It's hardly an epic incident, but it's a good example of "doing the work", I think. I borrowed the "write down what you learned" approach from Julia "b0rk" Evans. It's a fantastic practice.
The PagerDuty incident: "Disk will be full in 12 hours. device:/dev/nvme0n1p1, host:stg-nomadusw2-client-..."
(Note for non-PD readers: We run Nomad where others might run Kubernetes.)
Here's the process I went through.
- Noticed that the usual
docker system prune -a -f
didn't resolve it - Tried
docker system prune -a -f
and it cleared up 0B