Topics to cover

Possible analysis

All the books on project gutenburg (http://www.gutenberg.org)
- Could be very gooasd to work with full text searching
- word counting and frequency analysis?
All of wikipedia content
- Most commonly appear 'red links' (list of topics that do not have articles)
Common Crawl Corpus (http://aws.amazon.com/datasets/41740)
- ...search engine?
Marvel Universe Social Graph (http://aws.amazon.com/datasets/5621954952932508)
- Social graph dataset would be very conducive to graph searching and Neo4j

Raw

There's this small company called Google, you probably haven't heard of it. Their search engine was built on their map reduce ideas and pioneered a lot of search
Facebook
Palantir
GitHub
Twitter
...

Sometimes obvious: search engines
Sometimes not obvious: trends, page access times, etc
Companies able to respond and be proactive about things they might not have even known about

We're going to be answering different questions about datasets using some of these techniques.
How similar are songs based on audio features and song metadata?
Are there any interesting relationships that we can find from the Marvel Universe social graph?
Maybe we'll build a search engine or two and be the next Google
Map out the most popular Wikipedia pages over the past couple months

Raw

Searches through every word in the document vs. searching through metadata
Can done through serial scanning on small numbers of documents
- scans the content directly
- grep does this
what do you do with a potentially large number of documents?
- index then search
- search works with the index instead of the document content