adrianp · December 15, 2015 20:19
diff --git a/Big List of Big Data b/Big List of Big Data
 This work-in-progress summarizes the way-too-many BigData(tm) technologies.
 This is by no means an in-depth description, but a very short summary so that
 I know where to look.


 1. Databases:
 * DynamoDB - aws.amazon.com/dynamodb/ - Amazon AWS integration, MapReduce
 * MongoDB - mongodb.org/ - JSON-style document database, SQL-like queries + MapReduce
 * Riak - basho.com/riak/ - Key-Value storage, MapReduce
 * CouchDB - couchdb.apache.org/ - JSON document storage, JavaScript Queries + MapReduce
 * Redis - redis.io/ - Key-Value storage, Pub/Sub messaging
 * HBase - hbase.apache.org/ - Bigtable-like capabilities on top of Hadoop and HDFS
 * Cassandra - cassandra.apache.org/ - BigTable-like, SQL-like queries + MapReduce
 * Hypertable - hypertable.org/ - Bigtable-like, SQL-like queries + MapReduce, strong commercial support
 * Accumulo - accumulo.apache.org/ - Key-Value storage, Bigtable+Hadoop+HDFS
 * Neo4j - neo4j.org/ - Graph database
 * Couchbase - couchbase.com/ - Document-oriented, querying + MapReduce
 * VoltDB - voltdb.com/ - OLTP/real-time processing database by Stonebraker, proprietary
 * scalaris - code.google.com/p/scalaris/ - Key-Value storage
 * Voldemort - project-voldemort.com/ - Key-Value storage, used at LinkedIn
 * MemcacheDB - memcachedb.org/ - Key-Value storage based on Memcached
 * VelocityDB - velocitydb.com/ - Object and Graph DB, Key-Value support
 * ElephantDB - github.com/nathanmarz/elephantdb/ - Database specialized on exporting key-valuedata from Hadoop

 Questions: Why does Apache have so many identical projects?


 2. Data analysis:
 * elasticsearch - elasticsearch.org/ - Distributed RESTful search and analytics on top of Lucene, Memchaced, JSON
 * Hadoop + HDFS - hadoop.apache.org/ - MapReduce implementation
 * Hive - hive.apache.org/ - Data warehouse over Hadoop
 * Mahoot - mahout.apache.org/ - Scalable ML
 * Pig - pig.apache.org/ - Uses Pig Latin to produce sequences of MapReduce jobs (for Hadoop)
 * D3.js - d3js.org/ - JavaScript library for visualizing data
 * R - r-project.org/ - Statistics
 * Julia - julialang.org/ - Potential replacement for R
 * Drill - incubator.apache.org/drill/ - Big data analysis based on Google Dremel
 * Gremlin - github.com/tinkerpop/gremlin/ - Graph analysis
 * Giraph - giraph.apache.org/ - Graph analysis
 * InfiniteGraph - objectivity.com/infinitegraph/ - Graph analysis, commercial
 * Golden Orb - goldenorbos.org/ - Graph analysis using Google Pregel on top of Hadoop
 * JethroData - jethrodata.com/ - Data analysis on top of Hadoop, commercial
 * Spark - spark-project.org/- Projects that aims to extend/improve Hadoop, move beyond MapReduce
 * HStreaming - hstreaming.com/ - Real time and batch processing workflow over Hadoop and HDFS, commercial


 3. Real time processing:
 * DBToaster - dbtoaster.org/ - Creates processing engines from SQL queries
 * Storm - storm-project.net/ - MapReduce over real time data
 * Trident - engineering.twitter.com/2012/08/trident-high-level-abstraction-for.html/ - Elegant abstraction for defining Storm topologies
 * Squall - github.com/epfldata/squall/ - SQL over Storm
 * SAP Hana - http://www.sap.com/solutions/technology/in-memory-computing-platform/hana/overview/index.epx/ - In-memory DB and stream processing, commercial
 * Esper - esper.codehaus.org/ - CEP, Java and .NET, commercial


 4. Infrastructure
 * ZooKeeper - zookeeper.apache.org/ - Distributed coordination
 * ZeroMQ - zeromq.org/ - Message transport layer
 * RabbitMQ - rabbitmq.com/ - Message transport layer
 * Kafka - kafka.apache.org/ - Publish/Subscribe messaging system
 * S4 - incubator.apache.org/s4/ - Real time processing infrastructure
 * Kestrel - github.com/robey/kestrel/ - Message transport layer
 * Ganglia - ganglia.sourceforge.net/ - Monitoring
 * OpenStack - openstack.org/ - Open source software for building clouds
 * Cloud Foundry - cloudfoundry.com/ - Deployment solution


 5. Resources:
 * Database comparison - http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis/
 * More comprehensive NoSQL list - http://nosql-database.org/
 * Big Data Right Now: Five Trendy Open Source Technologies (10.2012) - http://techcrunch.com/2012/10/27/big-data-right-now-five-trendy-open-source-technologies/?goback=%2Egde_4332669_member_225815227/
 * SQL is what’s next for Hadoop: Here’s who’s doing it (01.2013) - http://gigaom.com/2013/02/21/sql-is-whats-next-for-hadoop-heres-whos-doing-it/
 * Wikipedia, ofc: http://en.wikipedia.org/wiki/NoSQL
 * Nathan Marz (Storm developer) on beating the CAP theorem (as this is controversial, make sure to read the comments also): http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html
	This work-in-progress summarizes the way-too-many BigData(tm) technologies.
	This is by no means an in-depth description, but a very short summary so that
	I know where to look.


	1. Databases:
	* DynamoDB - aws.amazon.com/dynamodb/ - Amazon AWS integration, MapReduce
	* MongoDB - mongodb.org/ - JSON-style document database, SQL-like queries + MapReduce
	* Riak - basho.com/riak/ - Key-Value storage, MapReduce
	* CouchDB - couchdb.apache.org/ - JSON document storage, JavaScript Queries + MapReduce
	* Redis - redis.io/ - Key-Value storage, Pub/Sub messaging
	* HBase - hbase.apache.org/ - Bigtable-like capabilities on top of Hadoop and HDFS
	* Cassandra - cassandra.apache.org/ - BigTable-like, SQL-like queries + MapReduce
	* Hypertable - hypertable.org/ - Bigtable-like, SQL-like queries + MapReduce, strong commercial support
	* Accumulo - accumulo.apache.org/ - Key-Value storage, Bigtable+Hadoop+HDFS
	* Neo4j - neo4j.org/ - Graph database
	* Couchbase - couchbase.com/ - Document-oriented, querying + MapReduce
	* VoltDB - voltdb.com/ - OLTP/real-time processing database by Stonebraker, proprietary
	* scalaris - code.google.com/p/scalaris/ - Key-Value storage
	* Voldemort - project-voldemort.com/ - Key-Value storage, used at LinkedIn
	* MemcacheDB - memcachedb.org/ - Key-Value storage based on Memcached
	* VelocityDB - velocitydb.com/ - Object and Graph DB, Key-Value support
	* ElephantDB - github.com/nathanmarz/elephantdb/ - Database specialized on exporting key-valuedata from Hadoop

	Questions: Why does Apache have so many identical projects?


	2. Data analysis:
	* elasticsearch - elasticsearch.org/ - Distributed RESTful search and analytics on top of Lucene, Memchaced, JSON
	* Hadoop + HDFS - hadoop.apache.org/ - MapReduce implementation
	* Hive - hive.apache.org/ - Data warehouse over Hadoop
	* Mahoot - mahout.apache.org/ - Scalable ML
	* Pig - pig.apache.org/ - Uses Pig Latin to produce sequences of MapReduce jobs (for Hadoop)
	* D3.js - d3js.org/ - JavaScript library for visualizing data
	* R - r-project.org/ - Statistics
	* Julia - julialang.org/ - Potential replacement for R
	* Drill - incubator.apache.org/drill/ - Big data analysis based on Google Dremel
	* Gremlin - github.com/tinkerpop/gremlin/ - Graph analysis
	* Giraph - giraph.apache.org/ - Graph analysis
	* InfiniteGraph - objectivity.com/infinitegraph/ - Graph analysis, commercial
	* Golden Orb - goldenorbos.org/ - Graph analysis using Google Pregel on top of Hadoop
	* JethroData - jethrodata.com/ - Data analysis on top of Hadoop, commercial
	* Spark - spark-project.org/- Projects that aims to extend/improve Hadoop, move beyond MapReduce
	* HStreaming - hstreaming.com/ - Real time and batch processing workflow over Hadoop and HDFS, commercial


	3. Real time processing:
	* DBToaster - dbtoaster.org/ - Creates processing engines from SQL queries
	* Storm - storm-project.net/ - MapReduce over real time data
	* Trident - engineering.twitter.com/2012/08/trident-high-level-abstraction-for.html/ - Elegant abstraction for defining Storm topologies
	* Squall - github.com/epfldata/squall/ - SQL over Storm
	* SAP Hana - http://www.sap.com/solutions/technology/in-memory-computing-platform/hana/overview/index.epx/ - In-memory DB and stream processing, commercial
	* Esper - esper.codehaus.org/ - CEP, Java and .NET, commercial


	4. Infrastructure
	* ZooKeeper - zookeeper.apache.org/ - Distributed coordination
	* ZeroMQ - zeromq.org/ - Message transport layer
	* RabbitMQ - rabbitmq.com/ - Message transport layer
	* Kafka - kafka.apache.org/ - Publish/Subscribe messaging system
	* S4 - incubator.apache.org/s4/ - Real time processing infrastructure
	* Kestrel - github.com/robey/kestrel/ - Message transport layer
	* Ganglia - ganglia.sourceforge.net/ - Monitoring
	* OpenStack - openstack.org/ - Open source software for building clouds
	* Cloud Foundry - cloudfoundry.com/ - Deployment solution


	5. Resources:
	* Database comparison - http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis/
	* More comprehensive NoSQL list - http://nosql-database.org/
	* Big Data Right Now: Five Trendy Open Source Technologies (10.2012) - http://techcrunch.com/2012/10/27/big-data-right-now-five-trendy-open-source-technologies/?goback=%2Egde_4332669_member_225815227/
	* SQL is what’s next for Hadoop: Here’s who’s doing it (01.2013) - http://gigaom.com/2013/02/21/sql-is-whats-next-for-hadoop-heres-whos-doing-it/
	* Wikipedia, ofc: http://en.wikipedia.org/wiki/NoSQL
	* Nathan Marz (Storm developer) on beating the CAP theorem (as this is controversial, make sure to read the comments also): http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html