The paved paas to microservices at Netflix

TLDR

Key goals: velocity and reliability
Mean Time to Detect (MTTD) and Mean Time to Recovery (MTTR) are the key for reliability.
Consistency improves MTTD and MTTR, and therefor reliability.
Interoperability improves velocity.
A preassembled component needs to be flexible enough but convenient at the same time. Layers and flavours are the key to this.
For platform correctness CI and CD are everything.
Consolidated operations allow people to jump into production with ease

Some data

100 million customers
190 countries
125 million hours streamed per day

Use case

watch anything anywhere; tv, phone, pc, anywhere.
the data varies per device; for this they have the "Edge API"

Edge API: flexibility, velocity, etc. It consists of different endpoints for each client (ios, android, ps4, ...). Complex. They had 1000 instances running.

As a step foreward, the broke each of those endpoints into different services. Some in node, some in Java, whatever.

The Netflix ethos is: you operate what you own.

Goals are: velocity and realibility; because we innovate and things work always. Usually the tradeoff is either fast or reliable, but both is hard. We want both.

Standardized components

Microservices all share the same things: RPC, discovery, runtime, alerts, tracing, configuration, etc.

Why have standards for your microservices? To avoid misunderstandings. For example, in the case of RPC there's a ton to pick from, but you want microservice interactions to be simple, so sticking with one is important.

MTTD and MTTR are the key metrics for the platform team. Every manager is always asking "Is it fixed yet?"

Differences between services make fixing things harder. No shared knowledge, tooling is more complex, etc. etc.

Consistency improves MTTD and MTTR

By not reinventing the wheel, people can focus on solving problems.

Interoperability improves velocity too.

By having only a small set of componetns you can keep improving them instead of spreading thin and decreasing quality.

Culture: Freedom and Responsibility -> You can use whatever you want, it empowers innovation. But don't be in a vacuum. Talk to people. Discuss with your peers. Careful with new tech, might not be the best; just pick responsibly.

Assembling a new service is hard: tons of things to put together. Getting out of the blocks requires...

Do some docs
Copy/paste some sample code
Which versions
Config
...

Preassembled platform

What if we bundle all that together so it's pre-assembled? No need to write a single line of platform code. Just import that one library.

If not, it's easy to miss something, some config. This hurts reliability. Having to create alerts or dashboards manually every time for every service means some day we'll forget a few. It's only human.

Every services ends up emitting a different flavour of these metrics even if they're bundled. This, again, increases MTTD and MTTR.

Which versions play nicely with which? By bundling things, this is a non-issue. Becomes a platform issue and not an issue to the engineers that do the business logic.

What's maintainance vs convenience? What do we include in our base platform? Initially, keep it simple... and have folks maintain their own things. But that defeats the purpose. Solution? Layers and flavours. A base platform and then layers on top.

How do we ensure the platform correctness? Test. Robust CI and CD. Test everything. Dogfood with your own service.

How do we ensure that all components are correct? Lock down versions of components so there's no variance. A platform version should lock down component versions. Updates require PRs, this triggers tests an the tright things.

Flexibility vs Reliability and consistency. Tradefoffs. However, it's important to no lock yourself down into a box. All abstractions require a small leak to be fully powerful -> If a colleague comes and requests something that the platform doesn't support, then having that peek inside the abstraction helps velocity.

Semantic versioning. Automate conventional changelogs.

Automation and tooling

Steps: dev, test, deploy, operations. So can we automate?

CLI for common dev experience. It's a unified interface to do anything you need such as bootstraping a new service. Integrate with used tools

For local development... docker. Attaching debuggers to docker, etc. This also allows for dropping a prod container and debug there.

Regarding testing, you want to test your business logic, not the platform. Provide first class mocks. Also mock data generation. Who owns the mocks though?

Just like you have a runtime API, you need a testing API. Mocks interface for components.

Deployment

production is war. How do avoid footguns? Put them in rails. preconfigured pipelines for deployment and rollback. Single command to deploy to any stack. Integration for automated canary analysis. pre-configured autoscaling.

Operations

Generating a consolidated view for engineers for anything. No need to learn too many things. Since we know the platform, we can generate dashboards and alerts.

manzanit0/paved-road-netflix.md