BKK Web Meetup - Big Data Open Panel, July 8th 2014

Following are some resources referred to during the discussion, afterward in one-on-one chats, plus a few that I thought might be of interest for futher reading. As the panel turned out to be, these are mostly technical, with a few of interest to analyst types. I too wish that we had a bit more coverage of business case studies.

Big Data: Principles and best practices of scalable realtime data systems

Nathan Marz and James Warren (Manning Publications)

A wealth of insight on building an architecture employing multiple data stores in order to serve differing latency requirements, and hedge risks and weaknesses. Marz initially developed Apache Storm for Backtype, acquired by Twitter. Still in "Early Access Edition" status, but all chapters are now available. If you're not technical but you'd like an introduction to the technical considerations, Marz's post that led to the book is a briefer overview.

Questioning the Lambda Architecture

Jay Kreps

Kreps, an engineer at LinkedIn and lead developer of Apache Kafka, responds with counterpoints to Marz's "lambda architecture". Fair criticisms, though you should still absolutely read the Marz book if you're tasked with designing a system like this.

The Log: What every software engineer should know about real-time data's unifying abstraction

Jay Kreps

If you're an engineer I think this post will fascinate you; if you're not, you probably won't slog through it :-) It covers a lot of ground about approaching system design as event streams, data flow abstractions that unify batch and stream processing, and "data integration" to make data available to all systems within an organization. A good reminder that "Big Data" is often the challenge of solving surprisingly simple problems but with very demanding scale/latency requirements.

The Elephant was a Trojan Horse: On the Death of Map-Reduce at Google

?? (thanks Jean Jordaan for sharing)

I mentioned that HDFS, the Hadoop Distributed File System, will likely live a longer life than the MapReduce divide-and-conquer processing framework that is the other banner component of Hadoop. Projects like Cascading and Scalding have been building higher-level tools atop M/R for some time, and Spark is now rapidly gaining momentum as an entirely new and broader approach. This article gives some perspective on the evolution.

RHadoop

Libraries for the R statistical programming language to run parallel cluster computations using the Hadoop infrastructure, and to interact with data stored on HDFS or HBase.

Using RHadoop to Predict Web Site Visitors

A simple RHadoop demo from the Hortonworks Hadoop distribution, using a single-system virtual machine for easy installation and testing. Most other distributions (Cloudera, MapR, Pivotal) offer a ready-to-use downloadable virtual machine image to try out Hadoop and it's supporting tools simply. Illustrates a bit of Hive as well, the tool for data warehouse queries (from HDFS) using a SQL dialect.

Realtime Big Data at Facebook with Hadoop and HBase

HBase is the columnar database — using HDFS for its storage — which can be suitable for Online Transaction Processing workloads (random access data retrieval, not sequential batch processes). The data model is comparable to Apache Cassandra, which does not build upon HDFS but does offer capable Hadoop integration. This talk is enjoyable in particular for NoSQL geeks. There used to be a better presentation/recording of this talk available on Cloudera's site behind an email collection wall, but it seems to be linkrot now :-(

Disco

Hadoop has the far greater mindshare, but this alternative map/reduce implementation and distributed filesystem originally developed at Nokia serves to show that tools can be developed beyond the Java Virtual Machine ecosystem. Disco's distributed guts are built in Erlang, and it exposes map/reduce APIs in Python.

ches/resources.md

Select an option

No results found