Last active
May 13, 2022 18:09
-
-
Save mharsch/fc04ef0d236ebb66965e1a0d0232ece9 to your computer and use it in GitHub Desktop.
user watchdog script for flooding system log errors
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
If you have a failure mode that can be diagnosed by a high rate of messages flooding the system log, | |
you can use the script below in conjunction with the watchdog service to detect the condition | |
and reboot if it persists. This is a pretty big hammer, but potentially better than just rebooting on | |
loss of network or some other arbitrary liveness test. | |
add a 'test-binary' line to /etc/watchdog.conf pointing to this script (will be treated as V0 test script | |
with no corresponding repair script). Also, set 'interval' to something like 20 seconds (at least greater than | |
the sample period in the script). | |
You can convince yourself it's working (once the watchdog service has been restarted with the above config) | |
by monitoring /var/log/watchdog/test-bin.stdout as well as the watchdog.service log while manuallying flooding | |
the log yourself: | |
while true;do | |
logger 'log storm' | |
done | |
logflood.sh: | |
#!/usr/bin/bash | |
readonly EUSERVALUE=246 | |
SAMPLE_SEC=3 | |
NUM_MSGS=`(timeout $SAMPLE_SEC tail -f -n 1 /var/log/syslog || true) | wc -l` | |
MEASURED_RATE=$((NUM_MSGS / SAMPLE_SEC)) | |
THRESHOLD_RATE=200 | |
if [[ $MEASURED_RATE -gt $THRESHOLD_RATE ]]; then | |
echo "log is getting flooded at a rate of $MEASURED_RATE messages per second" | |
exit $EUSERVALUE | |
fi | |
#echo "log looks fine; carry on" | |
exit 0 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment