markpapadakis · August 29, 2015 14:19 · payneio · Apr 19, 2015 · benjchristensen · Apr 21, 2015
diff --git a/Lambda architecture and stream processing b/Lambda architecture and stream processing
 This originated from @jboner's tweet (https://twitter.com/jboner/status/588806186667024385 ):

 I was going to email @benjchristensen, but @paulrpayne suggested this may not be the right way to conclude
 our participation in a Twitter thread about Lambda architecture semantics, stream processing
 and data partitioning.

 Here are some my thoughts on this topic as well as my experience  building and running such services.

 The Lambda architecture core concept is that ingested/incoming events/messages/datums/whatever are
 forwarded to two different layers; one practically buffers them as-is, or with little processing/transformation
 while the other persists them on disk(batch layer). Frequently, depending on the context and needs,
 background tasks perform data IO and compute intensive transformations and store them to a batch layer
 datastore (e.g rollups, aggregates for some dimensions and ranges, etc). 
 The idea is that incoming queries are executed on the fast/speed layer that’s buffering the (usually raw)
 data and the batch layer and the output is merged to produce a single materialized value/response. 

 There is nothing particularly novel here, except that it now has a name (‘Lambda architecture’), and that
 it has been gaining popularity.

 Ben suggested that you don’t really need multiple distinct systems to execute queries;
 a good streaming infrastructure should be able to do it, regardless of your data aggregation and
 storage strategies. He is right of course. 
 I was arguing that, while you can definitely compute a response by processing all data from [time 0, now] for
 every new request (caching not discussed in this context), it can potentially be expensive in terms of
 latency costs and resources needed to pull it off. 

 That is, to be able to execute a query that needs to access a (say) multi-TBs size dataset in ‘real-time’,
 you ‘d need to partition it in many shards/nodes, so that each kernel can process a subset of 
 [time 0, now] span in parallel (optimally, local to the node that holds the data). You ‘d probably need
 to employ fanout strategies, backup requests with request cancellations
 (See Jeff Dean’s “Achieving Rapid Response Times in Large Online Services”) in order to deal 
 with slow nodes that can stall everything else, state checkpoints with fast recovery to deal with
 kernel failures, forward-chain data flows, and maybe even a super-nodes(see: how skype works) or aggregation
 nodes hierarchy, and so on, so forth. 
 I believe many of those problems are solved in popular OS stream processing systems, though I haven’t gotten
 a chance to study them properly yet. 

 The tradeoff in this case is that the more you rely on offline processing and transformation, the less compute
 and data access time you ‘ll need, the fewer resources, the lower the latency. On the other hand,
 if you want to change your processing logic, you ‘d have to rebuild all that aggregated data and/or
 create and maintain even more offline transformations.
 Contrast this with doing away with offline transformations and always compute the query response 
 for the whole span of [time 0, now]; you can change the logic effortlessly, no need to maintain any
 parallel datasets, its simple and nice, but you need to do so much more to hit your latency targets
 and chances are good that you may not be able to do it anyway unless you secure more hardware.
 So, considering the pros and cons, you do what’s right.

 One of our services generates GBs of events daily and we need to be able to generate reports in
 real-time, based on those for multiple dimensions, and the queries are often times very complex.   
 We almost never use OSS or third party/proprietary software (we build everything in-house, but that’s a
 talk for another time;), and so we have our own infrastructure for this sort of thing (see
 earlier for some some design characteristics of that system).

 Initially, we relied on data partitioning and would use many kernels to access the data and were
 able to run the query in sub-second times, hitting our goal. As the dataset was expanding though, it was
 time to decide if we wanted to buy more hardware (we generally don’t like that alternative) or consider
 offline transformation schemes (see earlier) so that we ‘d need far less  compute and data access time;
 so we switched to a lambda-like architecture and it has worked fine so far. But it’s still one system,
 the same kernels that process the buffered/recent events   also process transformed data(aggregations)
 -- there are no distinct systems for different types of data.

 As @mike_acton said in his ‘Data-Oriented Design and C++’ talk(https://www.youtube.com/watch?v=rX0ItVEVjHc):
 - The purpose of all programs, and all parts of those programs, is to transform data from one form to another.
 - If you don’t understand the data, you don’t understand the problem.
 - Conversely, understand the problem by understanding the data.
 - Different problem require different solutions.
 - If you have different data, you have a different problem.
 - If you don’t understand the cost of solving the problem, you don’t understand the problem.
	This originated from @jboner's tweet (https://twitter.com/jboner/status/588806186667024385 ):

	I was going to email @benjchristensen, but @paulrpayne suggested this may not be the right way to conclude
	our participation in a Twitter thread about Lambda architecture semantics, stream processing
	and data partitioning.

	Here are some my thoughts on this topic as well as my experience building and running such services.

	The Lambda architecture core concept is that ingested/incoming events/messages/datums/whatever are
	forwarded to two different layers; one practically buffers them as-is, or with little processing/transformation
	while the other persists them on disk(batch layer). Frequently, depending on the context and needs,
	background tasks perform data IO and compute intensive transformations and store them to a batch layer
	datastore (e.g rollups, aggregates for some dimensions and ranges, etc).
	The idea is that incoming queries are executed on the fast/speed layer that’s buffering the (usually raw)
	data and the batch layer and the output is merged to produce a single materialized value/response.

	There is nothing particularly novel here, except that it now has a name (‘Lambda architecture’), and that
	it has been gaining popularity.

	Ben suggested that you don’t really need multiple distinct systems to execute queries;
	a good streaming infrastructure should be able to do it, regardless of your data aggregation and
	storage strategies. He is right of course.
	I was arguing that, while you can definitely compute a response by processing all data from [time 0, now] for
	every new request (caching not discussed in this context), it can potentially be expensive in terms of
	latency costs and resources needed to pull it off.

	That is, to be able to execute a query that needs to access a (say) multi-TBs size dataset in ‘real-time’,
	you ‘d need to partition it in many shards/nodes, so that each kernel can process a subset of
	[time 0, now] span in parallel (optimally, local to the node that holds the data). You ‘d probably need
	to employ fanout strategies, backup requests with request cancellations
	(See Jeff Dean’s “Achieving Rapid Response Times in Large Online Services”) in order to deal
	with slow nodes that can stall everything else, state checkpoints with fast recovery to deal with
	kernel failures, forward-chain data flows, and maybe even a super-nodes(see: how skype works) or aggregation
	nodes hierarchy, and so on, so forth.
	I believe many of those problems are solved in popular OS stream processing systems, though I haven’t gotten
	a chance to study them properly yet.

	The tradeoff in this case is that the more you rely on offline processing and transformation, the less compute
	and data access time you ‘ll need, the fewer resources, the lower the latency. On the other hand,
	if you want to change your processing logic, you ‘d have to rebuild all that aggregated data and/or
	create and maintain even more offline transformations.
	Contrast this with doing away with offline transformations and always compute the query response
	for the whole span of [time 0, now]; you can change the logic effortlessly, no need to maintain any
	parallel datasets, its simple and nice, but you need to do so much more to hit your latency targets
	and chances are good that you may not be able to do it anyway unless you secure more hardware.
	So, considering the pros and cons, you do what’s right.

	One of our services generates GBs of events daily and we need to be able to generate reports in
	real-time, based on those for multiple dimensions, and the queries are often times very complex.
	We almost never use OSS or third party/proprietary software (we build everything in-house, but that’s a
	talk for another time;), and so we have our own infrastructure for this sort of thing (see
	earlier for some some design characteristics of that system).

	Initially, we relied on data partitioning and would use many kernels to access the data and were
	able to run the query in sub-second times, hitting our goal. As the dataset was expanding though, it was
	time to decide if we wanted to buy more hardware (we generally don’t like that alternative) or consider
	offline transformation schemes (see earlier) so that we ‘d need far less compute and data access time;
	so we switched to a lambda-like architecture and it has worked fine so far. But it’s still one system,
	the same kernels that process the buffered/recent events also process transformed data(aggregations)
	-- there are no distinct systems for different types of data.

	As @mike_acton said in his ‘Data-Oriented Design and C++’ talk(https://www.youtube.com/watch?v=rX0ItVEVjHc):
	- The purpose of all programs, and all parts of those programs, is to transform data from one form to another.
	- If you don’t understand the data, you don’t understand the problem.
	- Conversely, understand the problem by understanding the data.
	- Different problem require different solutions.
	- If you have different data, you have a different problem.
	- If you don’t understand the cost of solving the problem, you don’t understand the problem.
No results found