Created
May 22, 2014 09:52
-
-
Save bcambel/291aca190db6669a5bc2 to your computer and use it in GitHub Desktop.
Big data list
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| <div class="entry-content"> | |
| <p>“<em><strong>Big-data</strong></em>” is one of the most inflated buzzword of the last years. Technologies born to handle huge datasets and overcome limits of previous products are gaining popularity outside the research environment. The following list would be a reference of this world. It’s still incomplete and always will be.</p> | |
| <p><strong>Framework</strong></p> | |
| <ul> | |
| <li><a href="http://hadoop.apache.org/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://hadoop.apache.org']);"><strong>Apache Hadoop</strong></a>: framework for distributed processing. Integrates MapReduce (parallel processing), YARN (job scheduling) and HDFS (distributed file system)</li> | |
| </ul> | |
| <p><strong>Distributed Programming</strong></p> | |
| <ul> | |
| <li><a href="http://databricks.github.io/simr/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://databricks.github.io']);"><strong>AMPLab SIMR</strong></a>: run Spark on Hadoop MapReduce v1</li> | |
| <li><a href="http://incubator.apache.org/projects/datafu.html" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://incubator.apache.org']);"><strong>Apache DataFu</strong></a>: collection of user-defined functions for Hadoop and Pig developed by LinkedIn</li> | |
| <li><a href="http://gora.apache.org/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://gora.apache.org']);"><strong>Apache Gora</strong></a>: framework for in-memory data model and persistence</li> | |
| <li><a href="http://hama.apache.org/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://hama.apache.org']);"><strong>Apache Hama</strong></a>: BSP (Bulk Synchronous Parallel) computing framework</li> | |
| <li><a href="http://wiki.apache.org/hadoop/MapReduce/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://wiki.apache.org']);"><strong>Apache MapReduce</strong></a>: programming model for processing large data sets with a parallel, distributed algorithm on a cluster</li> | |
| <li><a href="https://pig.apache.org/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://pig.apache.org']);"><strong>Apache Pig</strong></a>: high level language to express data analysis programs for Hadoop</li> | |
| <li><a href="http://incubator.apache.org/s4/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://incubator.apache.org']);"><strong>Apache S4</strong></a>: framework for stream processing, implementation of S4</li> | |
| <li><a href="http://spark.incubator.apache.org/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://spark.incubator.apache.org']);"><strong>Apache Spark</strong></a>: framework for in-memory cluster computing</li> | |
| <li><a href="http://spark.incubator.apache.org/docs/0.7.3/streaming-programming-guide.html" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://spark.incubator.apache.org']);"><strong>Apache Spark Streaming</strong></a>: framework for stream processing, part of Spark</li> | |
| <li><a href="http://storm-project.net/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://storm-project.net']);"><strong>Apache Storm</strong></a>: framework for stream processing by Twitter also on YARN</li> | |
| <li><a href="http://tez.incubator.apache.org/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://tez.incubator.apache.org']);"><strong>Apache Tez</strong></a>: application framework for executing a complex DAG (directed acyclic graph) of tasks, built on YARN</li> | |
| <li><a href="https://incubator.apache.org/projects/twill.html" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://incubator.apache.org']);"><strong>Apache Twill</strong></a>: abstraction over YARN that reduces the complexity of developing distributed applications</li> | |
| <li><a href="http://cascalog.org/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://cascalog.org']);"><strong>Cascalog</strong></a>: data processing and querying library</li> | |
| <li><a href="http://vldbarc.org/pvldb/vldb2010/pvldb_vol3/I08.pdf" onclick="javascript:_gaq.push(['_trackEvent','download','http://vldbarc.org/pvldb/vldb2010/pvldb_vol3/I08.pdf']);"><strong>Cheetah</strong></a>: High Performance, Custom Data Warehouse on Top of MapReduce</li> | |
| <li><a href="http://www.cascading.org/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://www.cascading.org']);"><strong>Concurrent Cascading</strong></a>: framework for data management/analytics on Hadoop</li> | |
| <li><a href="https://github.com/damballa/parkour" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://github.com']);"><strong>Damballa Parkour</strong></a>: MapReduce library for Clojure</li> | |
| <li><strong>Datasalt Pangool</strong>: alternative MapReduce paradigm</li> | |
| <li><strong>Facebook Corona</strong>: Hadoop enhancement which removes single point of failure</li> | |
| <li><a href="http://peregrine_mapreduce.bitbucket.org/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://peregrine_mapreduce.bitbucket.org']);"><strong>Facebook Peregrine</strong></a>: Map Reduce framework</li> | |
| <li><strong>Facebook Scuba</strong>: distributed in-memory datastore</li> | |
| <li><strong>Google MapReduce</strong>: map reduce framework</li> | |
| <li><strong>Google MillWheel</strong>: fault tolerant stream processing framework</li> | |
| <li><a href="http://db.cs.yale.edu/hadoopdb/hadoopdb.html" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://db.cs.yale.edu']);"><strong>HadoopDB</strong></a>: hybrid of MapReduce and DBMS</li> | |
| <li><a href="https://code.google.com/p/jaql/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://code.google.com']);"><strong>JAQL</strong></a>: declarative programming language for working with structured, semi-structured and unstructured data</li> | |
| <li><a href="http://druid.io/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://druid.io']);"><strong>Metamarkers Druid</strong></a>: framework for real-time analysis of large datasets</li> | |
| <li><a href="https://github.com/Netflix/PigPen" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://github.com']);"><strong>Netflix PigPen</strong></a>: map-reduce for Clojure whiche compiles to Apache Pig</li> | |
| <li><a href="http://discoproject.org/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://discoproject.org']);"><strong>Nokia Disco</strong></a>: MapReduce framework developed by Nokia</li> | |
| <li><a href="http://pydoop.sourceforge.net/docs/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://pydoop.sourceforge.net']);"><strong>Pydoop</strong></a>: Python MapReduce and HDFS API for Hadoop</li> | |
| <li><a href="http://stratosphere.eu/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://stratosphere.eu']);"><strong>Stratosphere</strong></a>: general purpose cluster computing framework</li> | |
| <li><a href="https://github.com/twitter/scalding" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://github.com']);"><strong>Twitter Scalding</strong></a>: Scala library for Map Reduce jobs, built on Cascading</li> | |
| <li><a href="https://github.com/twitter/summingbird" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://github.com']);"><strong>Twitter Summingbird</strong></a>: Streaming MapReduce with Scalding and Storm, by Twitter</li> | |
| </ul> | |
| <p><strong>Distributed Filesystem</strong></p> | |
| <ul> | |
| <li><a href="http://hadoop.apache.org/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://hadoop.apache.org']);"><strong>Apache HDFS</strong></a>: a way to store large files across multiple machines</li> | |
| <li><a href="http://ceph.com/ceph-storage/file-system/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://ceph.com']);"><strong>Ceph Filesystem</strong></a>: software storage platform designed</li> | |
| <li><a href="https://www.facebook.com/note.php?note_id=76191543919" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://www.facebook.com']);"><strong>Facebook Haystack</strong></a>: object storage system</li> | |
| <li><strong>Google Colossus</strong>: distributed filesystem (GFS2)</li> | |
| <li><strong>Google GFS</strong>: distributed filesystem</li> | |
| <li><strong>Google Megastore</strong>: scalable, highly available storage</li> | |
| <li><a href="http://www.gridgain.org/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://www.gridgain.org']);"><strong>GridGain</strong></a>: GGFS, Hadoop compliant in-memory file system</li> | |
| <li><a href="http://wiki.lustre.org/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://wiki.lustre.org']);"><strong>Lustre file system</strong></a>: high-performance distributed filesystem</li> | |
| <li><a href="https://www.quantcast.com/engineering/qfs/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://www.quantcast.com']);"><strong>Quantcast File System QFS</strong></a>: open-source distributed file system</li> | |
| <li><a href="http://www.gluster.org/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://www.gluster.org']);"><strong>Red Hat GlusterFS</strong></a>: cale-out network-attached storage file system</li> | |
| <li><a href="http://tachyon-project.org/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://tachyon-project.org']);"><strong>Tachyon</strong></a>: reliable file sharing at memory speed across cluster frameworks</li> | |
| </ul> | |
| <p><strong>Column Data Model</strong></p> | |
| <ul> | |
| <li><a href="http://accumulo.apache.org/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://accumulo.apache.org']);"><strong>Apache Accumulo</strong></a>: distribuited key/value store, built on Hadoop</li> | |
| <li><a href="http://cassandra.apache.org/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://cassandra.apache.org']);"><strong>Apache Cassandra</strong></a>: column-oriented distribuited datastore, inspired by BigTable</li> | |
| <li><a href="http://hbase.apache.org/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://hbase.apache.org']);"><strong>Apache HBase</strong></a>: column-oriented distribuited datastore, inspired by BigTable</li> | |
| <li><a href="http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/bigtable-osdi06.pdf" onclick="javascript:_gaq.push(['_trackEvent','download','http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/bigtable-osdi06.pdf']);"><strong>Google BigTable</strong></a>: column-oriented distributed datastore</li> | |
| <li><a href="https://developers.google.com/datastore/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://developers.google.com']);"><strong>Google Cloud Datastore</strong></a>: is a fully managed, schemaless database for storing non-relational data over BigTable</li> | |
| <li><a href="http://hypertable.org/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://hypertable.org']);"><strong>Hypertable</strong></a>: column-oriented distribuited datastore, inspired by BigTable</li> | |
| <li><a href="http://infinidb.co/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://infinidb.co']);"><strong>InfiniDB</strong></a>: is accessed through a MySQL interface and use massive parallel processing to parallelize queries</li> | |
| <li><a href="http://parquet.io/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://parquet.io']);"><strong>Parquet</strong></a>: columnar storage format for Hadoop.</li> | |
| </ul> | |
| <p><strong>Document Data Model</strong></p> | |
| <ul> | |
| <li><a href="http://comsysto.github.io/jumbodb/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://comsysto.github.io']);"><strong>jumboDB</strong></a>: document oriented datastore over Hadoop</li> | |
| <li><a href="http://data.linkedin.com/projects/espresso" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://data.linkedin.com']);"><strong>LinkedIn Espresso</strong></a>: horizontally scalable document-oriented NoSQL data store</li> | |
| <li><a href="http://www.mongodb.org/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://www.mongodb.org']);"><strong>MongoDB</strong></a>: Document-oriented database system</li> | |
| <li><a href="http://www.rethinkdb.com/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://www.rethinkdb.com']);"><strong>RethinkDB</strong></a>: document database that supports queries like table joins and group by</li> | |
| </ul> | |
| <p><strong>Key-value Data Model</strong></p> | |
| <ul> | |
| <li><a href="http://aws.amazon.com/dynamodb/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://aws.amazon.com']);"><strong>Amazon DynamoDB</strong></a>: distributed key/value store, implementation of Dynamo paper</li> | |
| <li><a href="https://github.com/nathanmarz/elephantdb" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://github.com']);"><strong>ElephantDB</strong></a>: Distributed database specialized in exporting data from Hadoop</li> | |
| <li><a href="http://geteventstore.com" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://geteventstore.com']);"><strong>EventStore</strong></a>: distributed time series database</li> | |
| <li><a href="http://www.project-voldemort.com/voldemort/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://www.project-voldemort.com']);"><strong>Linkedin Voldemort</strong></a>: distributed key/value storage system</li> | |
| <li><a href="http://opentsdb.net" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://opentsdb.net']);"><strong>OpenTSDB</strong></a>: distributed time series database on top of HBase</li> | |
| <li><a href="http://redis.io" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://redis.io']);"><strong>Redis DataBase</strong></a>: in memory key value datastore</li> | |
| <li><a href="http://rocksdb.org/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://rocksdb.org']);"><strong>RocksDB</strong></a>: embeddable persistent key-value store for fast storage based on LevelDB</li> | |
| <li><a href="https://github.com/twitter/storehaus" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://github.com']);"><strong>Storehaus</strong></a>: library to work with asynchronous key value stores, by Twitter</li> | |
| </ul> | |
| <p><strong>Graph Data Model</strong></p> | |
| <ul> | |
| <li><a href="http://giraph.apache.org/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://giraph.apache.org']);"><strong>Apache Giraph</strong></a>: implementation of Pregel, based on Hadoop</li> | |
| <li><a href="http://spark.incubator.apache.org/docs/0.7.3/bagel-programming-guide.html" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://spark.incubator.apache.org']);"><strong>Apache Spark Bagel</strong></a>: implementation of Pregel, part of Spark</li> | |
| <li><a href="https://www.arangodb.org/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://www.arangodb.org']);"><strong>ArangoDB</strong></a>: multi model distribuited database</li> | |
| <li><strong>Google Pregel</strong>: graph processing framework</li> | |
| <li><a href="https://amplab.cs.berkeley.edu/publication/graphx-grades/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://amplab.cs.berkeley.edu']);"><strong>GraphX</strong></a>: resilient Distributed Graph System on Spark</li> | |
| <li><strong>Intel GraphBuilder</strong>: tools to construct large-scale graphs on top of Hadoop</li> | |
| <li><a href="http://www.neo4j.org/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://www.neo4j.org']);"><strong>Neo4j</strong></a>: graph database writting entirely in Java</li> | |
| <li><a href="http://www.orientechnologies.com/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://www.orientechnologies.com']);"><strong>OrientDB</strong></a>: document and graph database</li> | |
| <li><a href="https://github.com/xslogic/phoebus" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://github.com']);"><strong>Phoebus</strong></a>: framework for large scale graph processing</li> | |
| <li><a href="http://thinkaurelius.github.io/titan/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://thinkaurelius.github.io']);"><strong>Titan</strong></a>: distributed graph database, built over Cassandra</li> | |
| <li><a href="https://github.com/twitter/flockdb" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://github.com']);"><strong>Twitter FlockDB</strong></a>: distribuited graph database</li> | |
| </ul> | |
| <p><strong>NewSQL Databases</strong></p> | |
| <ul> | |
| <li><a href="http://aws.amazon.com/redshift/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://aws.amazon.com']);"><strong>Amazon RedShift</strong></a>: data warehouse service, based on PostgreSQL</li> | |
| <li><a href="http://probcomp.csail.mit.edu/bayesdb/index.html" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://probcomp.csail.mit.edu']);"><strong>BayesDB</strong></a>: statistic oriented SQL database</li> | |
| <li><a href="https://foundationdb.com/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://foundationdb.com']);"><strong>FoundationDB</strong></a>: distributed database, inspired by F1</li> | |
| <li><strong>Google F1</strong>: distributed SQL database built on Spanner</li> | |
| <li><strong>Google Spanner</strong>: globally distributed semi-relational database</li> | |
| <li><strong>Haeinsa</strong>: linearly scalable multi-row, multi-table transaction library for HBase based on Percolator</li> | |
| <li><strong>HandlerSocket</strong>: NoSQL plugin for MySQL/MariaDB</li> | |
| <li><a href="http://www.infinisql.org/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://www.infinisql.org']);"><strong>InfiniSQL</strong></a>: infinity scalable RDBMS</li> | |
| <li><strong>InfluxDB</strong>: distributed time series database</li> | |
| <li><a href="http://www.memsql.com/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://www.memsql.com']);"><strong>MemSQL</strong></a>: in memory SQL database witho optimized columnar storage on flash</li> | |
| <li><a href="http://www.nuodb.com/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://www.nuodb.com']);"><strong>NuoDB</strong></a>: SQL/ACID compliant distributed database</li> | |
| <li><a href="http://senseidb.com/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://senseidb.com']);"><strong>SenseiDB</strong></a>: distributed, realtime, semi-structured database</li> | |
| <li><a href="http://skydb.io/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://skydb.io']);"><strong>Sky</strong></a>: database used for flexible, high performance analysis of behavioral data</li> | |
| <li><a href="http://www.symmetricds.org/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://www.symmetricds.org']);"><strong>SymmetricDS</strong></a>: open source software for both file and database synchronization</li> | |
| </ul> | |
| <p><strong>SQL-like processing</strong></p> | |
| <ul> | |
| <li><a href="https://github.com/amplab/shark/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://github.com']);"><strong>AMPLAB Shark</strong></a>: data warehouse system for Spark</li> | |
| <li><a href="http://incubator.apache.org/drill/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://incubator.apache.org']);"><strong>Apache Drill</strong></a>: framework for interactive analysis, inspired by Dremel</li> | |
| <li><a href="http://hive.apache.org/docs/hcat_r0.5.0/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://hive.apache.org']);"><strong>Apache HCatalog</strong></a>: table and storage management layer for Hadoop</li> | |
| <li><a href="http://hive.apache.org/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://hive.apache.org']);"><strong>Apache Hive</strong></a>: SQL-like data warehouse system for Hadoop</li> | |
| <li><a href="http://phoenix.incubator.apache.org/index.html" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://phoenix.incubator.apache.org']);"><strong>Apache Phoenix</strong></a>: SQL skin over HBase</li> | |
| <li><a href="http://blinkdb.org/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://blinkdb.org']);"><strong>BlinkDB</strong></a>: massively parallel, approximate query engine</li> | |
| <li><a href="http://www.cloudera.com/content/cloudera/en/products/cdh/impala.html" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://www.cloudera.com']);"><strong>Cloudera Impala</strong></a>: framework for interactive analysis, Inspired by Dremel</li> | |
| <li><a href="http://www.cascading.org/lingual/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://www.cascading.org']);"><strong>Concurrent Lingual</strong></a>: SQL-like query language for Cascading</li> | |
| <li><strong>Datasalt Splout SQL</strong>: full SQL query engine for big datasets</li> | |
| <li><a href="http://prestodb.io/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://prestodb.io']);"><strong>Facebook PrestoDB</strong></a>: distributed SQL query engine</li> | |
| <li><a href="http://research.google.com/pubs/pub36632.html" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://research.google.com']);"><strong>Google BigQuery</strong></a>: framework for interactive analysis, implementation of Dremel</li> | |
| <li><a href="http://www.gopivotal.com/pivotal-products/data/pivotal-hd" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://www.gopivotal.com']);"><strong>Pivotal HAWQ</strong></a>: SQL-like data warehouse system for Hadoop</li> | |
| <li><a href="http://databricks.com/blog/2014/03/26/Spark-SQL-manipulating-structured-data-using-Spark.html" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://databricks.com']);"><strong>SparkSQL</strong></a>: Manipulating Structured Data Using Spark</li> | |
| <li><a href="http://hortonworks.com/labs/stinger/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://hortonworks.com']);"><strong>Stinger</strong></a>: interactive query for Hive</li> | |
| <li><a href="http://tajo.incubator.apache.org/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://tajo.incubator.apache.org']);"><strong>Tajo</strong></a>: distributed data warehouse system on Hadoop</li> | |
| </ul> | |
| <p><strong>Data Ingestion</strong></p> | |
| <ul> | |
| <li><a href="http://aws.amazon.com/kinesis/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://aws.amazon.com']);"><strong>Amazon Kinesis</strong></a>: real-time processing of streaming data at massive scale</li> | |
| <li><a href="http://incubator.apache.org/chukwa/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://incubator.apache.org']);"><strong>Apache Chukwa</strong></a>: data collection system</li> | |
| <li><a href="http://flume.apache.org/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://flume.apache.org']);"><strong>Apache Flume</strong></a>: service to manage large amount of log data</li> | |
| <li><a href="http://kafka.apache.org/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://kafka.apache.org']);"><strong>Apache Kafka</strong></a>: distributed publish-subscribe messaging system</li> | |
| <li><a href="http://samza.incubator.apache.org/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://samza.incubator.apache.org']);"><strong>Apache Samza</strong></a>: stream processing framework, based on Kafla and YARN</li> | |
| <li><a href="http://sqoop.apache.org/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://sqoop.apache.org']);"><strong>Apache Sqoop</strong></a>: tool to transfer data between Hadoop and a structured datastore</li> | |
| <li><strong>Cloudera Morphline</strong>: framework that help ETL to Solr, HBase and HDFS</li> | |
| <li><a href="https://github.com/facebook/scribe" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://github.com']);"><strong>Facebook Scribe</strong></a>: streamed log data aggregator</li> | |
| <li><a href="http://fluentd.org/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://fluentd.org']);"><strong>Fluentd</strong></a>: tool to collect events and logs</li> | |
| <li><strong>HIHO</strong>: framework for connecting disparate data sources with Hadoop</li> | |
| <li><a href="http://robey.github.io/kestrel/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://robey.github.io']);"><strong>Kestrel</strong></a>: distributed message queue system</li> | |
| <li><a href="http://data.linkedin.com/projects/databus" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://data.linkedin.com']);"><strong>LinkedIn Databus</strong></a>: stream of change capture events for a database</li> | |
| <li><a href="https://github.com/linkedin/kamikaze" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://github.com']);"><strong>LinkedIn Kamikaze</strong></a>: utility package for compressing sorted integer arrays</li> | |
| <li><a href="https://github.com/linkedin/white-elephant" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://github.com']);"><strong>LinkedIn White Elephant</strong></a>: log aggregator and dashboard</li> | |
| <li><strong>Netflix Suro</strong>: log agregattor like Storm and Samza based on Chukwa</li> | |
| </ul> | |
| <p><strong>Service Programming</strong></p> | |
| <ul> | |
| <li><strong>Akka Toolkit</strong>:</li> | |
| <li><a href="http://avro.apache.org/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://avro.apache.org']);"><strong>Apache Avro</strong></a>: data serialization system</li> | |
| <li><strong>Apache Curator</strong>: Java libaries for Apache ZooKeeper</li> | |
| <li><strong>Apache Karaf</strong>: OSGi runtime that runs on top of any OSGi framework</li> | |
| <li><a href="http://thrift.apache.org//" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://thrift.apache.org']);"><strong>Apache Thrift</strong></a>: framework to build binary protocols</li> | |
| <li><a href="http://zookeeper.apache.org/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://zookeeper.apache.org']);"><strong>Apache Zookeeper</strong></a>: centralized service for process management</li> | |
| <li><a href="http://research.google.com/archive/chubby.html" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://research.google.com']);"><strong>Google Chubby</strong></a>: a lock service for loosely-coupled distributed systems</li> | |
| <li><a href="http://data.linkedin.com/opensource/norbert" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://data.linkedin.com']);"><strong>Linkedin Norbert</strong></a>: cluster manager</li> | |
| <li><a href="http://www.open-mpi.org/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://www.open-mpi.org']);"><strong>OpenMPI</strong></a>: message passing framework</li> | |
| <li><a href="http://www.serfdom.io/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://www.serfdom.io']);"><strong>Serf</strong></a>: decentralized solution for service discovery and orchestration</li> | |
| <li><a href="https://github.com/spring-projects/spring-xd" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://github.com']);"><strong>Spring XD</strong></a>:</li> | |
| <li><a href="https://github.com/kevinweil/elephant-bird" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://github.com']);"><strong>Twitter Elephant Bird</strong></a>: libraries for working with LZOP-compressed data</li> | |
| <li><strong>Twitter Finagle</strong>: asynchronous network stack for the JVM</li> | |
| </ul> | |
| <p><strong>Scheduling</strong></p> | |
| <ul> | |
| <li><a href="http://falcon.incubator.apache.org/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://falcon.incubator.apache.org']);"><strong>Apache Falcon</strong></a>: data management framework</li> | |
| <li><a href="http://oozie.apache.org/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://oozie.apache.org']);"><strong>Apache Oozie</strong></a>: workflow job scheduler</li> | |
| <li><a href="http://airbnb.github.io/chronos/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://airbnb.github.io']);"><strong>Chronos</strong></a>: distributed and fault-tolerant scheduler</li> | |
| <li><a href="http://azkaban.github.io/azkaban2/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://azkaban.github.io']);"><strong>Linkedin Azkaban</strong></a>: batch workflow job scheduler</li> | |
| </ul> | |
| <p><strong>Machine Learning</strong></p> | |
| <ul> | |
| <li><a href="http://mahout.apache.org/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://mahout.apache.org']);"><strong>Apache Mahout</strong></a>: machine learning library for Hadoop</li> | |
| <li><a href="https://github.com/cloudera/oryx" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://github.com']);"><strong>Cloudera Oryx</strong></a>: real-time large-scale machine learning</li> | |
| <li><a href="http://www.cascading.org/pattern/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://www.cascading.org']);"><strong>Concurrent Pattern</strong></a>: machine learning library for Cascading</li> | |
| <li><a href="http://www.etcml.com/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://www.etcml.com']);"><strong>etcML</strong></a>: text classification with machine learning</li> | |
| <li><a href="http://0xdata.github.io/h2o/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://0xdata.github.io']);"><strong>H2O</strong></a>: statistical, machine learning and math runtime for Hadoop</li> | |
| <li><a href="http://www.mlbase.org/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://www.mlbase.org']);"><strong>MLbase</strong></a>: distributed machine learning libraries for the BDAS stack</li> | |
| <li><a href="http://prediction.io/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://prediction.io']);"><strong>PredictionIO</strong></a>: machine learning server buit on Hadoop, Mahout and Cascading</li> | |
| <li><a href="http://spark.apache.org/docs/0.9.0/mllib-guide.html" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://spark.apache.org']);"><strong>Spark MLlib</strong></a>: a Spark implementation of some common machine learning (ML) functionality</li> | |
| <li><a href="https://github.com/JohnLangford/vowpal_wabbit/wiki" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://github.com']);"><strong>Vowpal Wabbit</strong></a>: learning system sponsored by Microsoft and Yahoo!</li> | |
| <li><strong>WEKA</strong>: suite of machine learning software</li> | |
| </ul> | |
| <p><strong>Bechmarking</strong></p> | |
| <ul> | |
| <li><a href="https://issues.apache.org/jira/browse/MAPREDUCE-3561" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://issues.apache.org']);"><strong>Apache Hadoop Benchmarking</strong></a>: micro-benchmarks for testing Hadoop performances</li> | |
| <li><a href="https://github.com/SWIMProjectUCB/SWIM/wiki" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://github.com']);"><strong>Berkeley SWIM Benchmark</strong></a>: real-world big data workload benchmark</li> | |
| <li><strong>Intel HiBench</strong>: a Hadoop benchmark suite</li> | |
| <li><a href="https://issues.apache.org/jira/browse/MAPREDUCE-5116" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://issues.apache.org']);"><strong>PUMA Benchmarking</strong></a>: benchmark suite for MapReduce applications</li> | |
| <li><strong>Yahoo Gridmix3</strong>: Hadoop cluster benchmarking from Yahoo engineer team</li> | |
| </ul> | |
| <p><strong>Security</strong></p> | |
| <ul> | |
| <li><strong>Apache Knox Gateway</strong>: single point of secure access for Hadoop clusters</li> | |
| <li><strong>Apache Sentry</strong>: security module for data stored in Hadoop</li> | |
| </ul> | |
| <p><strong>System Deployment</strong></p> | |
| <ul> | |
| <li><a href="http://ambari.apache.org/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://ambari.apache.org']);"><strong>Apache Ambari</strong></a>: operational framework for Hadoop mangement</li> | |
| <li><a href="http://bigtop.apache.org//" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://bigtop.apache.org']);"><strong>Apache Bigtop</strong></a>: system deployment framework for the Hadoop ecosystem</li> | |
| <li><a href="http://helix.apache.org/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://helix.apache.org']);"><strong>Apache Helix</strong></a>: cluster management framework</li> | |
| <li><a href="http://mesos.apache.org/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://mesos.apache.org']);"><strong>Apache Mesos</strong></a>: cluster manager</li> | |
| <li><a href="http://whirr.apache.org/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://whirr.apache.org']);"><strong>Apache Whirr</strong></a>: set of libraries for running cloud services</li> | |
| <li><strong>Brooklyn</strong>: library that simplifies application deployment and management</li> | |
| <li><a href="http://buildoop.github.io/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://buildoop.github.io']);"><strong>Buildoop</strong></a>: Similar to Apache BigTop based on Groovy language</li> | |
| <li><strong>Cloudera HUE</strong>: web application for interacting with Hadoop</li> | |
| <li><strong>Facebook Prism</strong>: multi datacenters replication system</li> | |
| <li><a href="http://www.wired.com/wiredenterprise/2013/03/google-borg-twitter-mesos/all/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://www.wired.com']);"><strong>Google Borg</strong></a>: job scheduling and monitoring system</li> | |
| <li><a href="https://www.youtube.com/watch?v=0ZFMlO98Jkc" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://www.youtube.com']);"><strong>Google Omega</strong></a>: job scheduling and monitoring system</li> | |
| <li><a href="http://hortonworks.com/blog/introducing-hoya-hbase-on-yarn/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://hortonworks.com']);"><strong>Hortonworks HOYA</strong></a>: application that can deploy HBase cluster on YARN</li> | |
| <li><strong>Marathon</strong>: Mesos framework for long-running services</li> | |
| </ul> | |
| <p><strong>Applications</strong></p> | |
| <ul> | |
| <li><strong>Apache Kiji</strong>: framework to collect and analyze data in real-time, based on HBase</li> | |
| <li><strong>Apache Nutch</strong>: open source web crawler</li> | |
| <li><strong>Apache OODT</strong>: capturing, processing and sharing of data for NASA’s scientific archives</li> | |
| <li><a href="https://tika.apache.org/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://tika.apache.org']);"><strong>Apache Tika</strong></a>: content analysis toolkit</li> | |
| <li><strong>Eclipse BIRT</strong>: Eclipse-based reporting system</li> | |
| <li><strong>HIPI Library</strong>: API for performing image processing tasks on Hadoop’s MapReduce</li> | |
| <li><a href="http://www.splunk.com/download/hunk" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://www.splunk.com']);"><strong>Hunk</strong></a>: Splunk analytics for Hadoop</li> | |
| <li><strong>Jedox Palo</strong>: customisable Business Intelligence platform</li> | |
| <li><a href="http://madlib.net/community/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://madlib.net']);"><strong>MADlib</strong></a>: data-processing library of an RDBMS to analyze data</li> | |
| <li><a><strong>PivotalR</strong></a>: R on Pivotal HD / HAWQ and PostgreSQL</li> | |
| <li><strong>Spango BI</strong>:</li> | |
| <li><a href="http://amplab-extras.github.io/SparkR-pkg/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://amplab-extras.github.io']);"><strong>SparkR</strong></a>: R frontend for Spark</li> | |
| <li><a href="http://www.splunk.com/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://www.splunk.com']);"><strong>Splunk</strong></a>: analyzer for machine-generated date</li> | |
| <li><strong>Talend</strong>:</li> | |
| </ul> | |
| <p><strong>Search engine and framework</strong></p> | |
| <ul> | |
| <li><a href="http://lucene.apache.org/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://lucene.apache.org']);"><strong>Apache Lucene</strong></a>: Search engine library</li> | |
| <li><a href="http://lucene.apache.org/solr/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://lucene.apache.org']);"><strong>Apache Solr</strong></a>: Search platform for Apache Lucene</li> | |
| <li><a href="http://www.elasticsearch.org/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://www.elasticsearch.org']);"><strong>ElasticSearch</strong></a>: Search and analytics engine based on Apache Lucene</li> | |
| <li><strong>Facebook Unicorn</strong>: social graph search platform</li> | |
| <li><a href="http://googleblog.blogspot.it/2010/06/our-new-search-index-caffeine.html" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://googleblog.blogspot.it']);"><strong>Google Caffeine</strong></a>: continuous indexing system</li> | |
| <li><a href="http://research.google.com/pubs/pub36726.html" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://research.google.com']);"><strong>Google Percolator</strong></a>: continuous indexing system</li> | |
| <li><a><strong>TeraGoogle</strong></a>: large search index</li> | |
| <li><a href="https://blogs.apache.org/hbase/entry/coprocessor_introduction" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://blogs.apache.org']);"><strong>HBase Comprocessor</strong></a>: implementation of Percolator, part of HBase</li> | |
| <li><a href="http://sphinxsearch.com/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://sphinxsearch.com']);"><strong>Sphnix Search Server</strong></a>: fulltext search engine</li> | |
| </ul> | |
| <p><strong>MySQL forks and evolutions</strong></p> | |
| <ul> | |
| <li><a href="http://aws.amazon.com/rds/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://aws.amazon.com']);"><strong>Amazon RDS</strong></a>: MySQL databases in Amazon’s cloud</li> | |
| <li><strong>Drizzle</strong>: evolution of MySQL 6.0</li> | |
| <li><a href="https://developers.google.com/cloud-sql/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://developers.google.com']);"><strong>Google Cloud SQL</strong></a>: MySQL databases in Google’s cloud</li> | |
| <li><a href="https://mariadb.org/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://mariadb.org']);"><strong>MariaDB</strong></a>: enhanced, drop-in replacement for MySQL</li> | |
| <li><a href="http://www.mysql.com/products/cluster/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://www.mysql.com']);"><strong>MySQL Cluster</strong></a>: MySQL implementation using NDB Cluster storage engine</li> | |
| <li><a href="http://www.percona.com/software/percona-server" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://www.percona.com']);"><strong>Percona Server</strong></a>: enhanced, drop-in replacement for MySQL</li> | |
| <li><a href="https://github.com/renecannao/proxysql" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://github.com']);"><strong>ProxySQL</strong></a>: High Performance Proxy for MySQL</li> | |
| <li><strong>TokuDB</strong>: TokuDB is a storage engine for MySQL and MariaDB</li> | |
| </ul> | |
| <p><strong>Memcached forks and evolutions</strong></p> | |
| <ul> | |
| <li><a href="https://www.facebook.com/notes/facebook-engineering/mcdipper-a-key-value-cache-for-flash-storage/10151347090423920" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://www.facebook.com']);"><strong>Facebook McDipper</strong></a>: key/value cache for flash storage</li> | |
| <li><a href="https://www.facebook.com/notes/facebook-engineering/scaling-memcache-at-facebook/10151411410803920" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://www.facebook.com']);"><strong>Facebook Memcached</strong></a>: fork of Memcache</li> | |
| <li><a href="https://github.com/twitter/twemproxy" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://github.com']);"><strong>Twemproxy</strong></a>: A fast, light-weight proxy for memcached and redis</li> | |
| <li><a href="https://github.com/twitter/fatcache" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://github.com']);"><strong>Twitter Fatcache</strong></a>: key/value cache for flash storage</li> | |
| <li><a href="https://github.com/twitter/twemcache" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://github.com']);"><strong>Twitter Twemcache</strong></a>: fork of Memcache</li> | |
| </ul> | |
| <p><strong>Embemmed Databases</strong></p> | |
| <ul> | |
| <li><a href="http://www.oracle.com/us/products/database/berkeley-db/overview/index.html" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://www.oracle.com']);"><strong>BerkeleyDB</strong></a>: a software library that provides a high-performance embedded database for key/value data</li> | |
| <li><a href="https://code.google.com/p/leveldb/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://code.google.com']);"><strong>LevelDB</strong></a>: a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values.</li> | |
| <li><a href="http://symas.com/mdb/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://symas.com']);"><strong>LMDB</strong></a>: ultra-fast, ultra-compact key-value embedded data store developed by Symas</li> | |
| </ul> | |
| <p>This page is built merging the <a href="http://hadoopecosystemtable.github.io/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://hadoopecosystemtable.github.io']);">Hadoop Ecosystem Table</a> (by <a href="http://es.linkedin.com/in/javiroman/" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://es.linkedin.com']);">Javi Roman</a> and <a href="https://github.com/hadoopecosystemtable/hadoopecosystemtable.github.io/graphs/contributors" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://github.com']);">other contributors</a>) with previous version of this list. Look at the extended version here: <strong><a href="http://bigdata.andreamostosi.name">http://bigdata.andreamostosi.name</a></strong></p> | |
| <p><strong>Interesting papers 2001 – 2010</strong></p> | |
| <ul> | |
| <li>2003 - <strong>Google</strong> - <a href="http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/gfs-sosp2003.pdf" onclick="javascript:_gaq.push(['_trackEvent','download','http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/gfs-sosp2003.pdf']);">The <strong>Google File System</strong></a></li> | |
| <li>2004 - <strong>Google</strong> - <a href="http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/mapreduce-osdi04.pdf" onclick="javascript:_gaq.push(['_trackEvent','download','http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/mapreduce-osdi04.pdf']);"><strong>MapReduce</strong>: Simplied Data Processing on Large Clusters</a></li> | |
| <li>2006 - <strong>Google</strong> - <a href="http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/bigtable-osdi06.pdf" onclick="javascript:_gaq.push(['_trackEvent','download','http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/bigtable-osdi06.pdf']);"><strong>Bigtable</strong>: A Distributed Storage System for Structured Data</a></li> | |
| <li>2006 – <strong>Google</strong> - <a href="http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/chubby-osdi06.pdf" onclick="javascript:_gaq.push(['_trackEvent','download','http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/chubby-osdi06.pdf']);">The <strong>Chubby</strong> lock service for loosely-coupled distributed systems</a></li> | |
| <li>2007 – <strong>Amazon</strong> - <a href="http://www.read.seas.harvard.edu/~kohler/class/cs239-w08/decandia07dynamo.pdf" onclick="javascript:_gaq.push(['_trackEvent','download','http://www.read.seas.harvard.edu/~kohler/class/cs239-w08/decandia07dynamo.pdf']);"><strong>Dynamo</strong>: Amazon’s Highly Available Key-value Store</a></li> | |
| <li>2007 - <a href="http://www.cs.utexas.edu/users/lorenzo/corsi/cs380d/papers/paper2-1.pdf" onclick="javascript:_gaq.push(['_trackEvent','download','http://www.cs.utexas.edu/users/lorenzo/corsi/cs380d/papers/paper2-1.pdf']);">Paxos Made Live – An Engineering Perspective</a>, describe <strong>Chubby</strong></li> | |
| <li>2008 – <strong>AMPLab</strong> - <a href="http://www.cca08.org/papers/Paper-13-Ariel-Rabkin.pdf" onclick="javascript:_gaq.push(['_trackEvent','download','http://www.cca08.org/papers/Paper-13-Ariel-Rabkin.pdf']);"><strong>Chukwa</strong>: A large-scale monitoring system</a></li> | |
| <li>2009 - <a href="http://www.vldb.org/pvldb/2/vldb09-861.pdf" onclick="javascript:_gaq.push(['_trackEvent','download','http://www.vldb.org/pvldb/2/vldb09-861.pdf']);"><strong>HadoopDB</strong>: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads</a></li> | |
| <li>2010 – <strong>Yahoo</strong> - <a href="http://www.4lunas.org/pub/2010-s4.pdf" onclick="javascript:_gaq.push(['_trackEvent','download','http://www.4lunas.org/pub/2010-s4.pdf']);"><strong>S4</strong>: Distributed Stream Computing Platform</a></li> | |
| <li>2010 - <strong>Google</strong> - <a href="http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//pubs/archive/36632.pdf" onclick="javascript:_gaq.push(['_trackEvent','download','http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//pubs/archive/36632.pdf']);"><strong>Dremel</strong>: Interactive Analysis of Web-Scale Datasets</a></li> | |
| <li>2010 - <strong>Google</strong> - <a href="http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//pubs/archive/36726.pdf" onclick="javascript:_gaq.push(['_trackEvent','download','http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//pubs/archive/36726.pdf']);">Large-scale Incremental Processing Using Distributed Transactions and Notifications</a> base of <strong>Percolator</strong> and <strong>Caffeine</strong></li> | |
| <li>2010 - <strong>Google</strong> - <a href="http://kowshik.github.io/JPregel/pregel_paper.pdf" onclick="javascript:_gaq.push(['_trackEvent','download','http://kowshik.github.io/JPregel/pregel_paper.pdf']);"><strong>Pregel</strong>: A System for Large-Scale Graph Processing</a></li> | |
| <li>2010 – <strong>Google</strong> - <a href="http://static.googleusercontent.com/media/research.google.com/en/us/university/relations/facultysummit2010/storage_architecture_and_challenges.pdf" onclick="javascript:_gaq.push(['_trackEvent','download','http://static.googleusercontent.com/media/research.google.com/en/us/university/relations/facultysummit2010/storage_architecture_and_challenges.pdf']);">Storage Architecture and Challenges </a></li> | |
| <li>2010 – <strong>AMPLab</strong> - <a href="https://amplab.cs.berkeley.edu/wp-content/uploads/2011/06/Spark-Cluster-Computing-with-Working-Sets.pdf" onclick="javascript:_gaq.push(['_trackEvent','download','http://amplab.cs.berkeley.edu/wp-content/uploads/2011/06/Spark-Cluster-Computing-with-Working-Sets.pdf']);"><strong>Spark</strong>: Cluster Computing with Working Sets</a></li> | |
| <li>2010 – <strong>Facebook</strong> - <a href="https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Beaver.pdf" onclick="javascript:_gaq.push(['_trackEvent','download','http://www.usenix.org/legacy/event/osdi10/tech/full_papers/Beaver.pdf']);">Finding a needle in <strong>Haystack</strong>: Facebook’s photo storage</a></li> | |
| </ul> | |
| <p><strong>Interesting papers 2011 – 2012</strong></p> | |
| <ul> | |
| <li>2011 – <strong>Google</strong> - <a href="http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//pubs/archive/36971.pdf" onclick="javascript:_gaq.push(['_trackEvent','download','http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//pubs/archive/36971.pdf']);"><strong>Megastore</strong>: Providing Scalable, Highly Available Storage for Interactive Services</a></li> | |
| <li>2011 – <strong>AMPLab</strong> - <a href="https://amplab.cs.berkeley.edu/wp-content/uploads/2011/06/Mesos-A-Platform-for-Fine-Grained-Resource-Sharing-in-the-Data-Center.pdf" onclick="javascript:_gaq.push(['_trackEvent','download','http://amplab.cs.berkeley.edu/wp-content/uploads/2011/06/Mesos-A-Platform-for-Fine-Grained-Resource-Sharing-in-the-Data-Center.pdf']);"><strong>Mesos</strong>: A Platform for Fine-Grained Resource Sharing in the Data Center</a></li> | |
| <li>2011 – <strong>AMPLab</strong> - <a href="https://amplab.cs.berkeley.edu/wp-content/uploads/2011/06/euro118-ananthanarayanan.pdf" onclick="javascript:_gaq.push(['_trackEvent','download','http://amplab.cs.berkeley.edu/wp-content/uploads/2011/06/euro118-ananthanarayanan.pdf']);"><strong>Scarlett</strong>: Coping with Skewed Popularity Content in MapReduce Clusters</a></li> | |
| <li>2012 – <strong>Google</strong> - <a href="http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/spanner-osdi2012.pdf" onclick="javascript:_gaq.push(['_trackEvent','download','http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/spanner-osdi2012.pdf']);"><strong>Spanner</strong>: Google’s Globally-Distributed Database</a> (describe also <strong>Colossus</strong>)</li> | |
| <li>2012 – <strong>Google</strong> - <a href="http://vldb.org/pvldb/vol5/p1436_alexanderhall_vldb2012.pdf" onclick="javascript:_gaq.push(['_trackEvent','download','http://vldb.org/pvldb/vol5/p1436_alexanderhall_vldb2012.pdf']);">Processing a trillion cells per mouse click</a> (base of <strong>PowerDrill</strong>)</li> | |
| <li>2013 – <strong>AMPLab</strong> - <a href="http://arxiv.org/pdf/1203.5485.pdf" onclick="javascript:_gaq.push(['_trackEvent','download','http://arxiv.org/pdf/1203.5485.pdf']);"><strong>BlinkDB</strong>: Queries with Bounded Errors and Bounded Response Times on Very Large Data</a></li> | |
| <li>2012 – <strong>Microsoft</strong> - <a href="http://research.microsoft.com/pubs/178045/ppaoxs-paper29.pdf" onclick="javascript:_gaq.push(['_trackEvent','download','http://research.microsoft.com/pubs/178045/ppaoxs-paper29.pdf']);">Paxos Made Parallel</a> (base of <strong>Tribble)</strong></li> | |
| <li>2012 – <strong>Microsoft</strong> - <a href="https://www.usenix.org/legacy/event/nsdi11/tech/full_papers/Bolosky.pdf" onclick="javascript:_gaq.push(['_trackEvent','download','http://www.usenix.org/legacy/event/nsdi11/tech/full_papers/Bolosky.pdf']);">Paxos Replicated State Machines as the Basis of a High-Performance Data Store</a></li> | |
| <li>2012 – <strong>AMPLab</strong> - <a href="https://amplab.cs.berkeley.edu/wp-content/uploads/2012/03/mod482-xin1.pdf" onclick="javascript:_gaq.push(['_trackEvent','download','http://amplab.cs.berkeley.edu/wp-content/uploads/2012/03/mod482-xin1.pdf']);"><strong>Shark</strong>: Fast Data Analysis Using Coarse-grained Distributed Memory</a></li> | |
| <li>2012 – <strong>AMPLab</strong> - <a href="https://www.usenix.org/system/files/login/articles/zaharia.pdf" onclick="javascript:_gaq.push(['_trackEvent','download','http://www.usenix.org/system/files/login/articles/zaharia.pdf']);">Fast and Interactive Analytics over Hadoop Data with <strong>Spark</strong></a></li> | |
| <li>2012 – <strong>AMPLab</strong> - <a href="https://amplab.cs.berkeley.edu/wp-content/uploads/2013/04/blinkdb_vldb12_demo.pdf" onclick="javascript:_gaq.push(['_trackEvent','download','http://amplab.cs.berkeley.edu/wp-content/uploads/2013/04/blinkdb_vldb12_demo.pdf']);">Blink and It’s Done: Interactive Queries on Very Large Data</a></li> | |
| </ul> | |
| <p><strong>Interesting papers 2013 – present</strong></p> | |
| <ul> | |
| <li>2013 – <strong>Facebook</strong> – <a href="https://www.usenix.org/system/files/conference/nsdi13/nsdi13-final170_update.pdf" onclick="javascript:_gaq.push(['_trackEvent','download','http://www.usenix.org/system/files/conference/nsdi13/nsdi13-final170_update.pdf']);">Scaling Memcache at Facebook</a></li> | |
| <li>2013 - <strong>Facebook</strong> - <a href="http://db.disi.unitn.eu/pages/VLDBProgram/pdf/industry/p871-curtiss.pdf" onclick="javascript:_gaq.push(['_trackEvent','download','http://db.disi.unitn.eu/pages/VLDBProgram/pdf/industry/p871-curtiss.pdf']);"><strong>Unicorn</strong>: A System for Searching the Social Graph</a></li> | |
| <li>2013 - <strong>Facebook</strong> - <a href="http://db.disi.unitn.eu/pages/VLDBProgram/pdf/industry/p767-wiener.pdf" onclick="javascript:_gaq.push(['_trackEvent','download','http://db.disi.unitn.eu/pages/VLDBProgram/pdf/industry/p767-wiener.pdf']);"><strong>Scuba</strong>: Diving into Data at Facebook</a></li> | |
| <li>2013 – <strong>Google</strong> - <a href="http://db.disi.unitn.eu/pages/VLDBProgram/pdf/industry/p734-akidau.pdf" onclick="javascript:_gaq.push(['_trackEvent','download','http://db.disi.unitn.eu/pages/VLDBProgram/pdf/industry/p734-akidau.pdf']);"><strong>MillWheel</strong>: Fault-Tolerant Stream Processing at Internet Scale</a></li> | |
| <li>2013 - <strong>Google</strong> - <a href="http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/pubs/archive/41344.pdf" onclick="javascript:_gaq.push(['_trackEvent','download','http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/pubs/archive/41344.pdf']);"><strong>F1</strong>: A Distributed SQL Database That Scales</a></li> | |
| <li>2013 - <strong>Google</strong> - <a href="http://db.disi.unitn.eu/pages/VLDBProgram/pdf/industry/p764-rae.pdf" onclick="javascript:_gaq.push(['_trackEvent','download','http://db.disi.unitn.eu/pages/VLDBProgram/pdf/industry/p764-rae.pdf']);">Online, Asynchronous Schema Change in <strong>F1</strong></a></li> | |
| <li>2013 – <strong>Metamarkets</strong> - <a href="http://static.druid.io/docs/druid.pdf" onclick="javascript:_gaq.push(['_trackEvent','download','http://static.druid.io/docs/druid.pdf']);"><strong>Druid</strong>: A Real-time Analytical Data Store</a></li> | |
| <li>2013 – <strong>Microsoft</strong> - <a href="http://research.microsoft.com/pubs/200169/now-vldb.pdf" onclick="javascript:_gaq.push(['_trackEvent','download','http://research.microsoft.com/pubs/200169/now-vldb.pdf']);">Scalable Progressive Analytics on Big Data in the Cloud</a></li> | |
| <li>2013 – <strong>Google</strong> - <a href="http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//pubs/archive/40671.pdf" onclick="javascript:_gaq.push(['_trackEvent','download','http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//pubs/archive/40671.pdf']);"><strong>HyperLogLog</strong> in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm</a></li> | |
| <li>2013 – <strong>AMPLab</strong> - <a href="https://amplab.cs.berkeley.edu/wp-content/uploads/2013/05/grades-graphx_with_fonts.pdf" onclick="javascript:_gaq.push(['_trackEvent','download','http://amplab.cs.berkeley.edu/wp-content/uploads/2013/05/grades-graphx_with_fonts.pdf']);"><strong>GraphX</strong>: A Resilient Distributed Graph System on Spark</a></li> | |
| <li>2013 – <strong>AMPLab</strong> - <a href="https://amplab.cs.berkeley.edu/wp-content/uploads/2013/02/shark_sigmod2013.pdf" onclick="javascript:_gaq.push(['_trackEvent','download','http://amplab.cs.berkeley.edu/wp-content/uploads/2013/02/shark_sigmod2013.pdf']);"><strong>Shark</strong>: SQL and Rich Analytics at Scale</a></li> | |
| <li>2013 – <strong>AMPLab</strong> - <a href="https://amplab.cs.berkeley.edu/wp-content/uploads/2013/01/dmx1.pdf" onclick="javascript:_gaq.push(['_trackEvent','download','http://amplab.cs.berkeley.edu/wp-content/uploads/2013/01/dmx1.pdf']);"><strong>MLbase</strong>: A Distributed Machine-learning System</a></li> | |
| <li>2013 – <strong>AMPLab</strong> - <a href="https://amplab.cs.berkeley.edu/wp-content/uploads/2013/03/eurosys13-paper83.pdf" onclick="javascript:_gaq.push(['_trackEvent','download','http://amplab.cs.berkeley.edu/wp-content/uploads/2013/03/eurosys13-paper83.pdf']);"><strong>Presto</strong>: Distributed Machine Learning and Graph Processing with Sparse Matrices</a></li> | |
| </ul> | |
| <p><strong>Interesting books</strong></p> | |
| <ul> | |
| <li><a href="http://www.amazon.com/Big-Data-Glossary-Pete-Warden/dp/1449314597" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','http://www.amazon.com']);"><strong>Big Data Glossary</strong> (Warden – O’Reilly)</a></li> | |
| </ul> | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment