Skip to content

Instantly share code, notes, and snippets.

@schluppeck
Created September 16, 2015 16:01
Show Gist options
  • Save schluppeck/0504f247d580abf0d41e to your computer and use it in GitHub Desktop.
Save schluppeck/0504f247d580abf0d41e to your computer and use it in GitHub Desktop.
demo downloading a text file of a play from command line and then some simple analysis using UNIX command line tools.
# a simple list of UNIX / shell commands to download the Shakespeare play "King
# Richard III" in text format and do a quick demo of text commands.
#
# it uses the command curl to download and grep, tr, wc sort, unique etc.
#
# make a temporary directory somewhere
cd ~
mkdir grepdemo
cd ~/grepdemo
# get the text file from gutenberg.org and redirect into a text file:
# url is: http://www.gutenberg.org/cache/epub/1768/pg1768.txt
# the > richard-iii.txt "redirects" into text file.
curl http://www.gutenberg.org/cache/epub/1768/pg1768.txt > richard-iii.txt
# now the file is there - look at it
more richard-iii.txt
# how many words, lines? use word count program, wc
wc richard-iii.txt
# results are in word. line. character count
# to display the text file use cat
cat richard-iii.txt
# by PIPING the output from this command into the next unix one, we can now
# combine this with TR (.e.g turn all the space characters into newlines!)
cat richard-iii.txt | tr [:space:] '\n'
# -> match things NOT like "^\s*$" which means: begin with N spaces, then end.
# -> sort them
# -> grep lines that contain start with "to" and then end
# -> count the number of lines of result
cat richard-iii.txt | tr [:space:] '\n' | grep -v "^\s*$" | sort | grep "^to$" | wc -l
# this should the # of occurrences of TO in the play
# see also this thread on stackoverflow for more inspiration!
# http://unix.stackexchange.com/questions/39039/
echo "Lorem ipsum dolor sit sit amet." | tr [:space:] '\n' | grep -v "^\s*$" | sort | uniq -c | sort -bnr
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment