Skip to content

Instantly share code, notes, and snippets.

@berdosi
Last active December 2, 2017 19:32
Show Gist options
  • Save berdosi/4d09b738cb3c0fab2710a6ece749c7db to your computer and use it in GitHub Desktop.
Save berdosi/4d09b738cb3c0fab2710a6ece749c7db to your computer and use it in GitHub Desktop.
List the words from a handful of HTML files by frequency.
cat *.html | sed -e 's/<[^>]\+>//g' -e 's/[ \t]\+/\n/g' -e 's/[-\.\\(\\):,;0-9+|]//g'|sort | uniq -ci | sort -h
cat *.html | \ # all the files' contents
sed -e 's/<[^>]\+>//g' \ # without tags (assumes they don't contain line breaks)
-e 's/[ \t]\+/\n/g' \ # replace tabs and spaces with line breaks
-e 's/[^a-z]//gi' | \ # remove some non-letters (
sort | \ # sort once to make uniq work
uniq -ci | \ # show occurrence counts, case insensitive
sort -h # sort by numbers
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment