After reading Darius Kazemi's post, "Aphorism detection for fun but definitely not profit", I wanted in -- I've done a number of text-focused bots, but none that did anything more advanced than tokenizing things and making use of ngrams with Markov chains. I have some experience with NLP in Python so thought it would be fun to port it.
The essence of Darius's algorithm is:
- Read in Corpus
- Tokenize corpus into sentences
- Filter out sentences that match a few basic patterns