On 2018-08-11, from 01:36 to 04:13 UTC, chitter.xyz suffered an outage.
all times UTC
- 01:36 - the network goes down for an indeterminate length of time
- 01:XX - ImageMagick
convert
processes start hoarding memory - 01:XX - system goes under memory pressure
- 01:51 - mastodon app server serves its last request
- 01:54 - since it sets its own OOM score very high, only netdata gets killed, over and over. it gets restarted 10 minutes later every time and killed again soon after
- 02:33 - OOM-killer finally kills a
convert
process - 02:33 - journald crashes from SIGABRT, restarts
- 02:35 - systemd restarts mastodon app server
- 03:5X - codl wakes up and finds caddy is not accepting connections. nothing notable in caddy's logs
- 04:13 - caddy is restarted. everything comes back up immediately
It's not clear whether or not the network outage caused convert to go wild but it seems likely that it did
Restarting netdata over and over is absurd. What's more, netdata proved invaluable in investigating this after the fact, although one hour of history was not nearly enough. Will look into making it not raise its OOM score, and increasing how much history it keeps around. It's ok if some of it gets swapped out.
It seems caddy broke under memory pressure, possibly at 01:51. It stopped listening but didn't crash, so it was not restarted. Will look into systemd's watchdog facilities
Not sure why journald crashed but it didn't lose any logs in the process and that's very impressive. it even logged itself crashing :o
- ImageMagick: image processing library. mastodon uses its
convert
tool to scale images down and make thumbnails - Mastodon app server: the bit of mastodon that replies to users' requests, as opposed to the bits that run in the background
- OOM Killer: the bit of linux that, when the system is completely out of free memory, picks a process to kill to hopefully get back to a working system
- OOM Score: a score given to every process based on a dozen metrics like how much memory it and its children are using, how long it has been running, which user is running it, etc. when the OOM Killer is ran, the process with the highest OOM Score is killed. the OOM score of a process can be manually adjusted up or down for things that are less or more critical
- Netdata: real-time monitoring software. it is very good and very thorough but it does use a lot of memory
- systemd: supervisor software. it launches and monitors services and does a heap of useful things, and less useful things. people love to hate it
- journald: systemd component that keeps track of logs generated by services, as well as system logs
- codl: that's me!
- caddy: https server. in our setup, it proxies requests to the mastodon app server