Last active
December 2, 2017 19:32
-
-
Save berdosi/4d09b738cb3c0fab2710a6ece749c7db to your computer and use it in GitHub Desktop.
List the words from a handful of HTML files by frequency.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
cat *.html | sed -e 's/<[^>]\+>//g' -e 's/[ \t]\+/\n/g' -e 's/[-\.\\(\\):,;0-9+|]//g'|sort | uniq -ci | sort -h | |
cat *.html | \ # all the files' contents | |
sed -e 's/<[^>]\+>//g' \ # without tags (assumes they don't contain line breaks) | |
-e 's/[ \t]\+/\n/g' \ # replace tabs and spaces with line breaks | |
-e 's/[^a-z]//gi' | \ # remove some non-letters ( | |
sort | \ # sort once to make uniq work | |
uniq -ci | \ # show occurrence counts, case insensitive | |
sort -h # sort by numbers |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment