Skip to content

Instantly share code, notes, and snippets.

@ches
Last active August 29, 2015 14:03
Show Gist options
  • Select an option

  • Save ches/cb888f706db21d6cdd2b to your computer and use it in GitHub Desktop.

Select an option

Save ches/cb888f706db21d6cdd2b to your computer and use it in GitHub Desktop.
BKK Web Meetup - Big Data Open Panel, July 8th 2014 - Resources

Following are some resources referred to during the discussion, afterward in one-on-one chats, plus a few that I thought might be of interest for futher reading. As the panel turned out to be, these are mostly technical, with a few of interest to analyst types. I too wish that we had a bit more coverage of business case studies.

Nathan Marz and James Warren (Manning Publications)

A wealth of insight on building an architecture employing multiple data stores in order to serve differing latency requirements, and hedge risks and weaknesses. Marz initially developed Apache Storm for Backtype, acquired by Twitter. Still in "Early Access Edition" status, but all chapters are now available. If you're not technical but you'd like an introduction to the technical considerations, Marz's post that led to the book is a briefer overview.

Jay Kreps

Kreps, an engineer at LinkedIn and lead developer of Apache Kafka, responds with counterpoints to Marz's "lambda architecture". Fair criticisms, though you should still absolutely read the Marz book if you're tasked with designing a system like this.

Jay Kreps

If you're an engineer I think this post will fascinate you; if you're not, you probably won't slog through it :-) It covers a lot of ground about approaching system design as event streams, data flow abstractions that unify batch and stream processing, and "data integration" to make data available to all systems within an organization. A good reminder that "Big Data" is often the challenge of solving surprisingly simple problems but with very demanding scale/latency requirements.

?? (thanks Jean Jordaan for sharing)

I mentioned that HDFS, the Hadoop Distributed File System, will likely live a longer life than the MapReduce divide-and-conquer processing framework that is the other banner component of Hadoop. Projects like Cascading and Scalding have been building higher-level tools atop M/R for some time, and Spark is now rapidly gaining momentum as an entirely new and broader approach. This article gives some perspective on the evolution.

Libraries for the R statistical programming language to run parallel cluster computations using the Hadoop infrastructure, and to interact with data stored on HDFS or HBase.

A simple RHadoop demo from the Hortonworks Hadoop distribution, using a single-system virtual machine for easy installation and testing. Most other distributions (Cloudera, MapR, Pivotal) offer a ready-to-use downloadable virtual machine image to try out Hadoop and it's supporting tools simply. Illustrates a bit of Hive as well, the tool for data warehouse queries (from HDFS) using a SQL dialect.

HBase is the columnar database — using HDFS for its storage — which can be suitable for Online Transaction Processing workloads (random access data retrieval, not sequential batch processes). The data model is comparable to Apache Cassandra, which does not build upon HDFS but does offer capable Hadoop integration. This talk is enjoyable in particular for NoSQL geeks. There used to be a better presentation/recording of this talk available on Cloudera's site behind an email collection wall, but it seems to be linkrot now :-(

Hadoop has the far greater mindshare, but this alternative map/reduce implementation and distributed filesystem originally developed at Nokia serves to show that tools can be developed beyond the Java Virtual Machine ecosystem. Disco's distributed guts are built in Erlang, and it exposes map/reduce APIs in Python.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment