-
-
Save markpapadakis/2d7eda3e4aaf9a308083 to your computer and use it in GitHub Desktop.
| This originated from @jboner's tweet (https://twitter.com/jboner/status/588806186667024385 ): | |
| I was going to email @benjchristensen, but @paulrpayne suggested this may not be the right way to conclude | |
| our participation in a Twitter thread about Lambda architecture semantics, stream processing | |
| and data partitioning. | |
| Here are some my thoughts on this topic as well as my experience building and running such services. | |
| The Lambda architecture core concept is that ingested/incoming events/messages/datums/whatever are | |
| forwarded to two different layers; one practically buffers them as-is, or with little processing/transformation | |
| while the other persists them on disk(batch layer). Frequently, depending on the context and needs, | |
| background tasks perform data IO and compute intensive transformations and store them to a batch layer | |
| datastore (e.g rollups, aggregates for some dimensions and ranges, etc). | |
| The idea is that incoming queries are executed on the fast/speed layer that’s buffering the (usually raw) | |
| data and the batch layer and the output is merged to produce a single materialized value/response. | |
| There is nothing particularly novel here, except that it now has a name (‘Lambda architecture’), and that | |
| it has been gaining popularity. | |
| Ben suggested that you don’t really need multiple distinct systems to execute queries; | |
| a good streaming infrastructure should be able to do it, regardless of your data aggregation and | |
| storage strategies. He is right of course. | |
| I was arguing that, while you can definitely compute a response by processing all data from [time 0, now] for | |
| every new request (caching not discussed in this context), it can potentially be expensive in terms of | |
| latency costs and resources needed to pull it off. | |
| That is, to be able to execute a query that needs to access a (say) multi-TBs size dataset in ‘real-time’, | |
| you ‘d need to partition it in many shards/nodes, so that each kernel can process a subset of | |
| [time 0, now] span in parallel (optimally, local to the node that holds the data). You ‘d probably need | |
| to employ fanout strategies, backup requests with request cancellations | |
| (See Jeff Dean’s “Achieving Rapid Response Times in Large Online Services”) in order to deal | |
| with slow nodes that can stall everything else, state checkpoints with fast recovery to deal with | |
| kernel failures, forward-chain data flows, and maybe even a super-nodes(see: how skype works) or aggregation | |
| nodes hierarchy, and so on, so forth. | |
| I believe many of those problems are solved in popular OS stream processing systems, though I haven’t gotten | |
| a chance to study them properly yet. | |
| The tradeoff in this case is that the more you rely on offline processing and transformation, the less compute | |
| and data access time you ‘ll need, the fewer resources, the lower the latency. On the other hand, | |
| if you want to change your processing logic, you ‘d have to rebuild all that aggregated data and/or | |
| create and maintain even more offline transformations. | |
| Contrast this with doing away with offline transformations and always compute the query response | |
| for the whole span of [time 0, now]; you can change the logic effortlessly, no need to maintain any | |
| parallel datasets, its simple and nice, but you need to do so much more to hit your latency targets | |
| and chances are good that you may not be able to do it anyway unless you secure more hardware. | |
| So, considering the pros and cons, you do what’s right. | |
| One of our services generates GBs of events daily and we need to be able to generate reports in | |
| real-time, based on those for multiple dimensions, and the queries are often times very complex. | |
| We almost never use OSS or third party/proprietary software (we build everything in-house, but that’s a | |
| talk for another time;), and so we have our own infrastructure for this sort of thing (see | |
| earlier for some some design characteristics of that system). | |
| Initially, we relied on data partitioning and would use many kernels to access the data and were | |
| able to run the query in sub-second times, hitting our goal. As the dataset was expanding though, it was | |
| time to decide if we wanted to buy more hardware (we generally don’t like that alternative) or consider | |
| offline transformation schemes (see earlier) so that we ‘d need far less compute and data access time; | |
| so we switched to a lambda-like architecture and it has worked fine so far. But it’s still one system, | |
| the same kernels that process the buffered/recent events also process transformed data(aggregations) | |
| -- there are no distinct systems for different types of data. | |
| As @mike_acton said in his ‘Data-Oriented Design and C++’ talk(https://www.youtube.com/watch?v=rX0ItVEVjHc): | |
| - The purpose of all programs, and all parts of those programs, is to transform data from one form to another. | |
| - If you don’t understand the data, you don’t understand the problem. | |
| - Conversely, understand the problem by understanding the data. | |
| - Different problem require different solutions. | |
| - If you have different data, you have a different problem. | |
| - If you don’t understand the cost of solving the problem, you don’t understand the problem. |
Thanks for the write-up ... much easier to discuss this type of topic with more than 140 characters at a time!
Fundamentally I feel that the issues with the need for a "Lambda Architecture" are found in the implied statements that (a) a stream processing system can't be reliable and fault tolerant and (b) that the serving layer can't be updated quickly.
I pull these from this paragraph stating the intent of a LA:
The LA aims to satisfy the needs for a robust system that is fault-tolerant, both against hardware failures and human mistakes, being able to serve a wide range of workloads and use cases, and in which low-latency reads and updates are required.
A stream processing system can indeed be fault tolerant, even if just through simple checkpointing mechanisms such as offered by Kafka and Samza (as just one example implementation).
As for the speed of updating the indexes, that seems like an orthogonal concern to how the data is processed. For example, if I throw all my processed data into a Lucene/Solr/ElasticSearch layer for retrieval, why should it matter to the data processing layer if that system needs to have multiple layers of indexes that are merged at search time and sometimes goes through compaction lifecycles?
Additionally, many use cases can easily use a serving tier that supports rapid update, such as simple distributed key/value databases.
I find it far simpler to build a single robust processing system that can read either historical data or a real-time stream, define the data processing in a single job (rather than two that must be compatible), and have a single serving layer without need to do a complicated merge of "historical" and "realtime".
Other (more informed) thinking on the topic can be found at these links:
Have Your Cake and Eat It Too -- Further Dispelling the Myths of the Lambda Architecture – https://qconsf.com/presentation/have-your-cake-and-eat-it-too-further-dispelling-myths-lambda-architecture
Questioning the Lambda Architecture – http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html
I think the essential point of quandry is a statement like this one from lambda-architecture.net, What is Lambda Architecture?: "The speed layer compensates for the high latency of updates to the serving layer"
In engineering terms, if your application is not responsive enough due to updates (C in CQRS) not being updated fast enough in your query (Q in CQRS) then you need to do something. Lambda Arch's "do it on the side and merge it in" seems to be one approach, the other might be "make updates faster".
Not having practical experience on a large scale application, that's about as far as my thoughts go. My intuition is that lambda is a stopgap measure and the real solution would be to find ways to remove latency in updates; perhaps i-confluence, crdt's, a re-thinking of your materialized views, etc.
Re-computing time 0 to now on every query seems like a real problem. With a judicious use of the above techniques I would hope that wouldn't be necessary, updates wouldn't be as slow and a separate processing/merge step wouldn't be necessary... or rather the "speed layer" is just an element of the pipeline but there is no merge step. A simple example would be computing a cumulative average with updates rather than brute-force re-counting from zero every time; or perhaps calculating a moving average rather than a cumulative if your application is just as happy.
I'm sure not having experience in your specific domain or system parameters makes me over-simplify. But my intuition resonates with @jboner's comment.