Skip to content

Instantly share code, notes, and snippets.

@tomelm
Last active December 20, 2015 16:29
Show Gist options
  • Save tomelm/6162072 to your computer and use it in GitHub Desktop.
Save tomelm/6162072 to your computer and use it in GitHub Desktop.
bigdata student org

Topics to cover

  • Full text search
  • Map Reduce
  • Graph searching and traversal
  • Machine Learning/statistics
  • NLP? (for text)

Possible analysis

  • word counts
  • basic statistics
    • averages
    • standard deviations
    • regression

Datasets

(http://aws.amazon.com/datasets)

Software

  • ElasticSearch/Solr/Lucene
  • Hadoop/HDFS
  • Neo4j

Schedule

  • First meeting: 9/12, intro
  • Second meeting: 9/26, intro to full text search
  • Third meeting: 10/10, full text search workshop
  • Fourth meeting: 10/24, map reduce
  • Fifth meeting: 11/7, map reduce
  • Sixth meeting: 11/21, map reduce
  • Sevent meeting: 12/5, last meeting? questions? graphs?

Intro Meeting

What is 'big data'?

  • it's a buzz word
  • it's a generic term for many different ideas and tools
  • it's a way to work with large datasets
  • it's a way to get information from seemingly informationless data

How big is big?

  • (usually) too big to do on one machine
  • too big to do in under x seconds

Who uses big data?

  • Everyone. Literally everyone.

What about use cases?

  • There's this small company called Google, you probably haven't heard of it. Their search engine was built on their map reduce ideas and pioneered a lot of search
  • Facebook
  • Palantir
  • GitHub
  • Twitter
  • ...

Why do we need it? / What's the point?

  • Sometimes obvious: search engines
  • Sometimes not obvious: trends, page access times, etc
  • Companies able to respond and be proactive about things they might not have even known about

More specifically, what is it?

  • Lots of things: (small list)
    • full text search
    • map reduce
    • graph search
    • data science, analysis, and statistics
    • machine learning

So, what are we doing?

  • We're going to be answering different questions about datasets using some of these techniques.
  • How similar are songs based on audio features and song metadata?
  • Are there any interesting relationships that we can find from the Marvel Universe social graph?
  • Maybe we'll build a search engine or two and be the next Google
  • Map out the most popular Wikipedia pages over the past couple months

Full Text Search

  • Searches through every word in the document vs. searching through metadata
  • Can done through serial scanning on small numbers of documents
    • scans the content directly
    • grep does this
  • what do you do with a potentially large number of documents?
    • index then search
    • search works with the index instead of the document content
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment