Skip to content

Instantly share code, notes, and snippets.

@portante
Last active September 12, 2019 16:22
Show Gist options
  • Select an option

  • Save portante/1b6ad4f7f8de4094258a8afd044afcec to your computer and use it in GitHub Desktop.

Select an option

Save portante/1b6ad4f7f8de4094258a8afd044afcec to your computer and use it in GitHub Desktop.
A simple bash script to watch all of the running fluentd pods, gathering and displaying those with on-disk queue file counts greater than two (2) (long queues), or with queue files older than 2 minutes. This script relies on pssh (https://github.com/lilydjwg/pssh) to work efficiently.
#!/bin/bash
function finish {
rm -rf $TMPDIR
}
trap finish EXIT
TMPDIR=$(mktemp -d)
mkdir $TMPDIR/output
let num=0
let slow=0
let onum=0
let oslow=0
let err=0
ohi --v3 --list -c $1 | awk '{ print "root@" $1 }' > $TMPDIR/hosts
pssh -h $TMPDIR/hosts -o $TMPDIR/output -t 30 "if [ -d /var/lib/fluentd ]; then ls -lt /var/lib/fluentd; fi" | sort -k 4 > $TMPDIR/pssh.log 2>&1
let tsnow=$(date --utc "+%s")
# Perform for one pass to summarize the findings to print as the first output
while read -r line || [[ -n "$line" ]]; do
status=$(echo "$line" | awk '{ print $3 }')
node=$(echo "$line" | awk '{ print $4 }')
if [ "$status" = "[SUCCESS]" ]; then
let count=$(grep -c output_tag $TMPDIR/output/${node})
if [ $count -gt 0 ]; then
if [ $count -gt 2 ]; then
let num=num+1
fi
let tsapp=$(date --utc --date="$(grep output_tag $TMPDIR/output/${node} | tail -n 1 | awk '{print $6 " " $7 " " $8}')" "+%s")
let diff=tsnow-tsapp
if [ $diff -gt 120 ]; then
let slow=slow+1
fi
fi
let ocount=$(grep -c retry_es $TMPDIR/output/${node})
if [ $ocount -gt 0 ]; then
if [ $ocount -gt 2 ]; then
let onum=onum+1
fi
let tsrtr=$(date --utc --date="$(grep retry_es $TMPDIR/output/${node} | tail -n 1 | awk '{print $6 " " $7 " " $8}')" "+%s")
let diff=tsnow-tsrtr
if [ $diff -gt 120 ]; then
let oslow=oslow+1
fi
fi
else
echo "** Failure communicating with ${node#*@}:"
cat $TMPDIR/output/${node}
let err=err+1
fi
done < $TMPDIR/pssh.log
echo "$num ($slow), $onum ($oslow) ($err errors)"
while read -r line || [[ -n "$line" ]]; do
status=$(echo "$line" | awk '{ print $3 }')
node=$(echo "$line" | awk '{ print $4 }')
if [ "$status" = "[SUCCESS]" ]; then
node=$(echo "$line" | awk '{ print $4 }')
let count=$(grep -c output_tag $TMPDIR/output/${node})
if [ $count -gt 0 ]; then
let tsapp=$(date --utc --date="$(grep output_tag $TMPDIR/output/${node} | tail -n 1 | awk '{print $6 " " $7 " " $8}')" "+%s")
let diff=tsnow-tsapp
else
let diff=0
fi
let ocount=$(grep -c retry_es $TMPDIR/output/${node})
if [ $ocount -gt 0 ]; then
let tsrtr=$(date --utc --date="$(grep retry_es $TMPDIR/output/${node} | tail -n 1 | awk '{print $6 " " $7 " " $8}')" "+%s")
let odiff=tsnow-tsrtr
else
let odiff=0
fi
if [ $count -gt 2 -o $ocount -gt 2 -o $diff -gt 120 -o $odiff -gt 120 ]; then
echo "${node#*@} $count ($diff) $ocount ($odiff)"
fi
fi
done < $TMPDIR/pssh.log
@MitchRolo
Copy link
Copy Markdown

MitchRolo commented Sep 21, 2018

Hi - RE line 14 "ohi" command. I am not familiar with this command. Can you advise where it can be found. thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment