This post is in response to the WeekendDevPuzzle of 2022-01-29, which has submissions from people who chose to share their perspectives on the puzzle. I'll be linking this post to this Twitter thread, to avoid polluting the original one.
Unlike the usual peel the onion kind of topics, this one focuses more on how we think & our mental models of architectural decisions. But there's more.
Over the past 15 yrs, programming has become extremely accessible. While that's undoubtedly a good thing, it's become much easier to just rely on best practice, with or without context. Today's puzzle was designed to bring to the discussion table, that context, albeit in a super simplified fashion.
Let's dissect the question, and bring out the core elements of it.
For a distributed system where call graph looks like A → B → C, it's easy to see that the call succeeds only when all of the components are available. We can write this as: P(U) = P(A) * P(B) * P(C)
, where
P(U)
is the probability of the call being processed correctly, as seen by the calling userP(A)
is the probability of component A being up- and so on...
Side note: If reading the word probability immediately switched off your mind, don't worry. We'll keep the maths super basic, given our super simplified scenarios. Who knows, it might even encourage you to get more comfortable with it!
It would seem like we have the answer to our puzzle then. But do we? Let's dig deeper.
The puzzle mentions the following:
- LB has never gone down.
- Web servers appear to go down for ~3 hrs/mo.
- DB appears to go down for ~1 hr/mo
What do you think it means for P(LB)
, P(WEB)
and P(DB)
?
- Is
P(LB) == 1.0
, as we've never observed it go down? Is observed availability, the same as designed availability? Maybe the switchover (planned flip from active to standby) incurs no loss, but failover (active fails, standby picks up) incurs 4 secs of loss. What number would you take then? - For
P(WEB)
, it would seem that it should be1 - 3/(30*24) == 99.58%
(3 hrs down out of 24x30 hrs in a month). But there are two issues with that:- It's not stated if this downtime includes the DB related outages. If it does, then what we're calculating here is really
P(WEB) * P(DB)
, and notP(WEB)
alone. Can you see why? - As with LB above, this is more like observed availability, than designed one.
- It's not stated if this downtime includes the DB related outages. If it does, then what we're calculating here is really
- For
P(DB)
, the observed availability should be1 - 1/(30*24) == 99.86%
So, at least we now have some richer perspective, even if more questions got added, about Scenario A. What about Scenarios B & C? After all, there was no mention about downtime numbers for them in the puzzle.
Let's start with Scenario B:
P(LB)
- not defined explicitly, but is there any reason for it to change?P(WEB)
- would this change? As we're separating out some code from it, this code is changing. Clearly, that should change its availability too. But for better or for worse?P(LB2)
- not defined explicitly, but any reason for it to be different thanP(LB)
above?P(SVC)
- not defined explicitly. Given that this service was created by carving out a piece of code from the web server, are there assumptions we can make about it, in relative terms toP(WEB)
?
Similarly, we can analyse the picture for Scenario C, with only one additional point of interest:
- The smart client - this topology aware client will have its own availability. Remember that for it to be topology aware, it'll need to fetch that information over the network. As such, even if it's perfectly implemented, its availability is going to be bound by that of the network,
P(N)
(am I correct in this statement here?)
Is there anything else we're missing?
We made an implicit assumption above, that our network is perfect. But that isn't always the case (cable breaks, SFP failures, OS network table full, network saturation, router misconfiguration, the list is long). A surprising number of engineers tend to assume perfect networks, just because they've not observed it fail, until it does, and again and again.
Let's denote the network's availability to be P(N) < 1.0
. Clearly every distributed call over that network, has its availability reduced by this factor. So, Scenario A would look something like P(U) = P(LB) * P(N) * P(WEB) * P(N) * P(DB)
. Two network hops has meant a reduction in availability by P(N)^2
. Can you see why?
Nitesh correctly points this out, though he mistakes my comment about observed availability to be design availability.
There're a few other implicit assumptions being made here, but for the sake of brevity, let's ignore them for now.
At this point, some of you might be thinking: "WTH! Dissecting it was supposed to simplify the question, not add to it!". Let's take a deep breath
Sometimes, breaking down a problem may noticeably amplify the number of factors, but the story from there only gets better, as we start hacking & slashing away at the problem, by making simplifying (yet informed and explicit) assumptions. So let's start doing that.
Let's quick go over the calculations we did earlier:
- Scenario A:
P(A) = P(LB) * P(N) * P(WEB) * P(N) * P(DB)
- Scenario B:
P(B) = P(LB) * P(N) * P(WEB) * P(N) * P(LB2) * P(N) * P(SVC) * P(N) * P(DB)
. Did you notice a hidden assumption here (that every single call requires every component, as against some calls being served by web server alone)? - Scenario C:
P(C) = P(LB) * P(N) * P(WEB) * P(SMART_CLIENT) * P(N) * P(SVC) * P(N) * P(DB)
. Some of you might notice how removal ofP(LB2) * P(N)
might help C get better. Let's see if it does. Also, hidden assumption of Scenario B applies here too.
I'm gonna make the following assumptions. The exact numbers don't matter, it's the model that matters:
P(LB) == 99.995%
and to remain the same in all scenariosP(LB) == P(LB2)
P(N) == 99.995%
P(SMART_CLIENT) == 99.99%
. There are details to this which are tricky to capture, but let's run with a simplified model for now.P(WEB)
andP(DB)
are independent of each other, i.e. web server's availability numbers do not include DB's availability. We assume this for simplicity for now, you can always play around with the numbers later.
For the remaining parameters, let's evaluate two situations, H1 (for hypothesis 1) and H2.
H1 (refactoring improved availability because of reduced code complexity, no more buggy DB driver code eating up web resources, etc):
P(WEB) == 99.7%
for Scenarios B & CP(SVC) == 99.65%
H2 (refactoring reduced availability due to poor coding, poor abstractions, lesser integration testing, etc):
P(WEB) == 99.5%
for Scenarios B & CP(SVC) == 99.58%
. It's difficult for me to visualise the micro service's availability going below a more complex web server complexity that includes the same logic.
Let's see what we get
Scenario | H1 | H2 |
---|---|---|
A | 99.43% | 99.43% |
B | 99.18% | 98.91% |
C | 99.18% | 98.91% |
So, this is very interesting. Whether you believe me or not, I didn't plan for the numbers to be like this. The following observations stand out to me:
- Scenario A is noticeably better than B or C, even in H1, i.e. where we're assuming a betterment of availability.
- Scenario B == Scenario C on availability.
- H2 being worse off in Scenarios B & C is no surprise, as we're deliberately assuming a worsening of post-refactor availabilities.
Let's analyse these outcomes one by one.
One can read this as Breaking a monolith while increasing individual availabilities, still reduced the overall availability. How did this happen? Well, two things:
- More moving parts means product of probabilities decreases faster.
- Network being assumed to be 99.995% available added to the cost. Even if we assume
P(N) == 100%
(ideal network), we still getP(B) = 99.20%
for H1, which means that the number of moving parts tends to dominate.
Can we assume this to be a universal truth? I'd say no. The takeaway here is that number of moving parts impacts availability substantially, and when breaking a monolith, the individual availability improvements should be large enough to compensate for the reduction due to increase in moving parts, e.g. if the refactoring in H1 improved P(WEB) == 99.85%
and P(SVC) == 99.75%
, Scenario B & C become better than A.
This is just an artifact of choosing P(SMART_CLIENT) == 99.99%
. A higher number would put Scenario B better than C, and a lower number will reverse that. The takeaway here is that when focussing on embedded smart clients, their code quality (which reflects in P(SMART_CLIENT)
) needs to be very high for it to be better than a HA dedicated hardware LB.
This part, though so clear mathematically, was a bit of a surprise. The fact that I was surprised, tells me that there was a bias in my head, a chink in my mental model, that assumed embedded smart clients to be better. Reflecting deeper, I feel the performance implications were leaking into the availability assumptions in my head.
It's not the answer that's important here, it's the model or the factors that lead to the answer. The value of each parameter would vary depending upon the circumstances, so the outcome can be different. So, I'd request you to focus on the takeaways, instead of the answer.
Phew! This was one of the longer ones.
Some of you might say that I've taken a particularly elaborate (to the point of being unnecessary) approach to the puzzle. I wouldn't disagree with you. But, in my defense, this is supposed to be a weekend puzzle 😄, to be noodled over & over, in different forms, lazily & elaborately. It isn't designed to be a race.
My own reason to be this elaborate, is simply that I wanted to lay as exhaustively as I could, all the different elements of the puzzle. For some of you, it might bring attention to a hidden chink in your mental model. For a few others, it might've helped convert implicit assumptions into explicit ones. For the remaining, it could either be an affirmation of how they thought, or an opportunity to help correct my calculations here.
Irrespective of whether you liked this elaborate approach or not, or even agree with my view to the puzzle, I hope you still had fun thinking about it, including all of the different aspects of it.
If someone cites this post as a reason for you to not build out micro services, this comment should help.
A common reason why monoliths start becoming shaky (in terms of availability), is because different parts are moving at different speeds. Sometimes, moving the high frequency changes into a separate service (if logical abstractions permit), can improve the availability significantly.
Another reason to break a monolith sometimes can be the trade off b/w availability & performance/cost, e.g. if a piece of code that gets used in only 10% of the calls, but takes up 70% of the memory, it might make sense to pull it out, because you can keep the rest of the code quite lean (resource consumption wise).
As I said at the beginning of OP, context matters. There are no blind rules.