Skip to content

Instantly share code, notes, and snippets.

@rgarner
Last active August 29, 2015 14:03
Show Gist options
  • Save rgarner/12bb84f2d2f0e36ab381 to your computer and use it in GitHub Desktop.
Save rgarner/12bb84f2d2f0e36ab381 to your computer and use it in GitHub Desktop.
Find CDN failures
##
# Find CDN output that has invalid UTF8 byte sequences.
# Fix them.
#
# NB: This requires an iconv that supports --byte-subst
#
FAILURES=utf8_failures
FAILURE_STRING=__WUBWUB__
echo "Finding failures"
trap "rm $FAILURES; exit" SIGHUP SIGINT SIGTERM
rm $FAILURES
# Find duff files by attempting to convert.
# Take stderr failures and pipe them to ./$FAILURES
for f in cdn*.tsv
do
(iconv -t UTF8 $f 2>&1 >/dev/null | grep -Po "cdn.*tsv") >> $FAILURES
done
echo "Fixing failures"
# Look at the failures and substitute a failure string
# Discard lines with failures
for f in `cat $FAILURES`
do
iconv --byte-subst=$FAILURE_STRING -f UTF8 -t UTF8 $f > "$f.new"
cat "$f.new" | (mawk "\$0 !~ /$FAILURE_STRING/" > $f) && rm "$f.new"
done
rm $FAILURES
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment