Created
September 16, 2015 16:01
-
-
Save schluppeck/0504f247d580abf0d41e to your computer and use it in GitHub Desktop.
demo downloading a text file of a play from command line and then some simple analysis using UNIX command line tools.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# a simple list of UNIX / shell commands to download the Shakespeare play "King | |
# Richard III" in text format and do a quick demo of text commands. | |
# | |
# it uses the command curl to download and grep, tr, wc sort, unique etc. | |
# | |
# make a temporary directory somewhere | |
cd ~ | |
mkdir grepdemo | |
cd ~/grepdemo | |
# get the text file from gutenberg.org and redirect into a text file: | |
# url is: http://www.gutenberg.org/cache/epub/1768/pg1768.txt | |
# the > richard-iii.txt "redirects" into text file. | |
curl http://www.gutenberg.org/cache/epub/1768/pg1768.txt > richard-iii.txt | |
# now the file is there - look at it | |
more richard-iii.txt | |
# how many words, lines? use word count program, wc | |
wc richard-iii.txt | |
# results are in word. line. character count | |
# to display the text file use cat | |
cat richard-iii.txt | |
# by PIPING the output from this command into the next unix one, we can now | |
# combine this with TR (.e.g turn all the space characters into newlines!) | |
cat richard-iii.txt | tr [:space:] '\n' | |
# -> match things NOT like "^\s*$" which means: begin with N spaces, then end. | |
# -> sort them | |
# -> grep lines that contain start with "to" and then end | |
# -> count the number of lines of result | |
cat richard-iii.txt | tr [:space:] '\n' | grep -v "^\s*$" | sort | grep "^to$" | wc -l | |
# this should the # of occurrences of TO in the play | |
# see also this thread on stackoverflow for more inspiration! | |
# http://unix.stackexchange.com/questions/39039/ | |
echo "Lorem ipsum dolor sit sit amet." | tr [:space:] '\n' | grep -v "^\s*$" | sort | uniq -c | sort -bnr |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment