This is a quick example showing how to use regexes to find tri-grams in Shakespeare...well, 570,872 of them, anyway, if we do some basic filtering of non-dialogue.
Though tokenization and n-grams should typically be done using a proper natural language processing framework, it's possible to do in a jiffy from the command-line, using standard Unix tools and ack, the better-than-grep utility.
As Wikipedia says:
In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech.
This exercise shows how to build tri-grams from Shakespeare, and it's easier seen than explained, so keep on reading. For practical purposes, n-grams are a useful way to determine statistically common (or rare) phrases in a given block of text, in a more specific way than simple word-counts.
You may have seen Google Book's interactive n-gram viewer:
If you're unfamiliar with n-grams, a great place to start is this book excerpt from Peter Norvig. That excerpt is linked to Norvig's page about ngrams, which contains datasets and other real-world exercises.
n-grams are pretty ubiquitous for language analsyis and are a common part of NLP frameworks. So the fun of this walkthrough is to see how it can be done from the command-line and Unix tooling, which is much quicker for experimenting than jumping into iPython or RStudio.
It's something I just discovered myself after digging around with the ack tool and remembering a basic concept about regex lookaheads.
The ack tool allows the full use of Perl-compatible regexes. And it has an --output
flag, which allows you to output capture groups:
$ echo "Nov 9, 2014" | ack '(\d{4})' --output 'The year is $1'
The year is 2014
But how do we use regexes to create n-grams? We use the zero-width property of lookaheads:
$ echo "do re me fa so la ti do" |
ack '(\w+) (?=(\w+) (\w+))' --output '$1 $2 $3'
The output:
do re me
re me fa
me fa so
fa so la
so la ti
la ti do
Looks pretty good!
mkdir -p 'tempshakespeare' && cd tempshakespeare
curl -s 'http://stash.compciv.org/scrapespeare/matty.shakespeare.tar.gz' \
| tar xvz
The unpacking process creates a tree structure like this:
├── README
├── comedies
│ ├── allswellthatendswell
│ ├── asyoulikeit
│ ├── comedyoferrors
│ ├── cymbeline
│ ├── loveslabourslost
│ ├── measureforemeasure
│ ├── merchantofvenice
│ ├── merrywivesofwindsor
│ ├── midsummersnightsdream
│ ├── muchadoaboutnothing
│ ├── periclesprinceoftyre
│ ├── tamingoftheshrew
│ ├── tempest
│ ├── troilusandcressida
│ ├── twelfthnight
│ ├── twogentlemenofverona
│ └── winterstale
├── glossary
├── histories
│ ├── 1kinghenryiv
│ ├── 1kinghenryvi
│ ├── 2kinghenryiv
│ ├── 2kinghenryvi
│ ├── 3kinghenryvi
│ ├── kinghenryv
│ ├── kinghenryviii
│ ├── kingjohn
│ ├── kingrichardii
│ └── kingrichardiii
├── poetry
│ ├── loverscomplaint
│ ├── rapeoflucrece
│ ├── sonnets
│ ├── various
│ └── venusandadonis
└── tragedies
├── antonyandcleopatra
├── coriolanus
├── hamlet
├── juliuscaesar
├── kinglear
├── macbeth
├── othello
├── romeoandjuliet
├── timonofathens
└── titusandronicus
cat */* |
# translate to lowercase
tr [:upper:] [:lower:] |
# change newlines and tabs to space characters
tr '\t\n' ' ' |
# delete all non-letters/spaces/apostrophes/numbers
sed -E "s/[^a-z0-9 ']+//g" |
# tokenize, and use lookahead+capture to perform 0-width matching
ack '(\S+) +(?=(\S+) +(\S+))' --output '$1 $2 $3' |
# sort, then unique count, then reverse sort numerically
sort | uniq -c | sort -rn
The top results -- because this was a raw text grep, all of the top tri-grams are character names (though honestly, I had never heard of Sir Toby Belch):
297 king henry vi
247 i pray you
217 i will not
188 king henry v
185 king richard iii
175 act iv scene
172 sir toby belch
160 i do not
157 i know not
154 act iii scene
146 act ii scene
142 i am a
140 i am not
We can refine the process by filtering for just text that is dialogue:
$ cat tragedies/hamlet | ack '^(?:[A-Z]+.*?)?\t(?: *\[Aside\] *)?([A-Z][a-z ]+.+)'
Uh...I'm not going to even try to explain that regex, except that it has something to do with how dialogue either starts as a single tab-space away from the beginning of a line or a tab away from a speaker's all-caps name, which itself always begins at the start of the line. But then we have to ignore the "[Aside.]"
that sometimes starts a block of dialogue.
Meh, it's good enough for a command-line exploration...parsing Shakespeare is probably best done in a real scripting environment:
Using that regex pattern to filter out non-dialogue text:
cat */* |
# just capture dialogue, aside from '[Aside]'
# takes advantage of the fact that this text uses tabs to separate dialogue
# from speaker
ack '^(?:[A-Z]+.*?)?\t(?: *\[Aside\] *)?([A-Z][a-z ]+.+)' --output '$1' |
# translate to lowercase
tr [:upper:] [:lower:] |
# change newlines and tabs to space characters
tr '\t\n' ' ' |
# delete all non-letters/spaces/apostrophes/numbers
sed -E "s/[^a-z0-9 ']+//g" |
# tokenize, and use lookahead+capture to perform 0-width matching
ack '(\S+) +(?=(\S+) +(\S+))' --output '$1 $2 $3' |
# sort, then unique count, then reverse sort numerically
sort | uniq -c | sort -rn
And now we have more pertinent results that aren't things like "Some Duke's Name". On my laptop, it takes about half-a-minute to generate and then sort and group the tri-grams. Not bad!
207 i pray you
180 i will not
143 my lord i
137 i do not
131 i know not
115 my good lord
112 i am not
107 this is the
105 and i will
103 the duke of
103 i am a
100 i would not
95 my lord of
93 there is no
91 that i have
87 it is a
81 i have a
80 that i am
80 good my lord
79 it is not
75 my lord and
73 i thank you
73 a room in
72 i will be
71 it is the
71 and all the
68 what's the matter
68 thou art a
68 i pray thee
68 i have done
66 as i am
65 if it be
63 you my lord
63 what is the
62 my lord the
62 and in the
61 i beseech you
Note: Peter Norvig has a prepped Shakespeare file stripped of all the non-dialogue, easy for the tokenizing. But the point of quick tokenizing/n-gramming is to be able to do it on any text corpus of your choosing: it's good to get good at processing text, if you want to do unique text analyses specific to your work and research.