"What I Wish I Had Known Before Scaling Uber to 1000 Services"

Video: https://www.youtube.com/watch?v=kb-m2fasdDY

Slides: https://gotocon.com/dl/goto-chicago-2016/slides/MattRanney_WhatIWishIHadKnownBeforeScalingUberTo1000Services.pdf

Notes

200 engineers -> 2000 engineers in 1y and a half

Regarding microservice deployment:

immutability
append only

Uber is more reliable in weekends... because less changes. If it works, don't touch it.

Why services?

releasing independently
owning uptime
using the ""best"" tool... (define best :)

About costs:

distributed system
RPC everywhere
what if it breaks

Subtle costs:

everything is a tradeoff -> You are ALWAYS giving things up, you might just not know yet which ones.
you can build around around problems
trade complexity for politics -> Instead of having to have an awkward with somebody, you can just write more software.
you get to keep your biases -> if you don't agree with a team or a person... instead of agreeing, you just write more software.

Regarding languages:

initially it was PHP, then started nodejs and python, and then eventually moving everything to Go except data and maps, which is Java and python.

that's a lot of languages
hard to share code
hard to move between teams -> knowledge translates worse from one team to another.
WIWIK: fragmented culture -> easy to end up having camps around the languages, inherent to human nature.

Regarding RPC:

HTTP/REST -> at scale, with a lots of people joining every week HTTP weaknesess start to show up: Conventions around statuses, querystrings, restfulness, etc. What you really want is simply to run a function in a server. Who cares about semantics at this point.
JSON -> human readable. nice. but... it is untyped. It becomes a mess in the future, not right away. When somebody suddenly changes something and two hops downstream they were depending on the type, or a subtle difference such us empty string vs null or whatever language quirk.
RPC is slow.
WIWIK: servers are not browsers

How many repos?

Multiple repos is good to have stuff nicely separated, to even opensource stuff.
One repo is good for cross cutting changes
Many is bad because it stresses out build system, navigation system, etc.
One is bad because it gets so big is unmanageable In Uber they had 8000 repos?

Operational

What happen when things break? -> Without getting into the downstream services issues, there are a lot of surprising people issues.
Can other teams release your service?You own your uptime, but another team needs a fix. So can they release your service while the tests pass? is your automation enough so other teams can release your process? Usually no. Usually you're blocked and end of story.
Understanding a service in the larger context -> The system is a one single machine... even though it's broken into smaller parts. So it's important that we still understand the system working as one.

Performance If there are so many languages... it depends on language tools In Go you have pprof, but not all runtimes have that. The ended up investing time in having a common format for all runtimes -> Flamegraphs.

If tooling is so different, it's friction.

Everybody wants dashboards, but if it's not easy for them to spin them up, teams will end up creating their own. This means that one team's dashboard will be radically different from another's. Getting a standard dashboard with a set of agreeable defaults out of the box for each service is invaluable. Now you can even look at other team's dashboards without knowing the service and understand things.

But should you even care about performance?

it doesn't matter until it matters. -> buying computers is cheaper than engineers mantra... but then one day you have a perf problem, and it doesn't matter.
minimum simple perf requirements -> some minimum SLA just so it's a number, but it's a bare minimum. It's a knob to turn at least.
WIWIK: "good" not required, but "known" is

Tracing Fanout -> causes a lot of perf problems you always have to wait for the slowest in the chain. If 1% of the time a service which on avg takes 1ms takes a second, then with fanout, chances become more quickly: use 100 becomes 63%. Get tracing. Best way to understand fanout. If not tracing, get logs.

RPC in loop -> bulk resolve, don't fanout. But to find these things you need tracing. ORM does a ton of SQL queries -> a seemingly innocent traversal is 10.000 requests. You need tracing.

it's a lot of work to do tracing -> you might want to do sampling? implementing this kind of thing requires cross-lang context propagation... which is a lot of work. but do it, we regretted not doing it sooner.

Logging

need consistent structured logging -> Everybody logs in different ways... so the best way around this is provide tools that are so easy to use it's impossible to do it in a different way.
multiple languages makes this hard
log flooding -> it isn't an issue until it is.
WIWIK: accounting -> people should have a notion of the costs of what they do; logging too much can be an issue.

zap -> super fast

Load testing

need to test against production
without breaking metrics -> avoid people getting scared
preferably all the time -> some bugs just show in peaks, so you want to be in peak all the time.
WIWIK: all systems need to handle "test" traffic

Failure testing

chaos monkey (netflix)
WIWIK: people won't like it -> some people hate it. Specially if you need to add it later. People won't want to opt in.

Migrations

old stuff still has to work -> some people just work in legacy; from something that used to be old to something that is less old but still isn't shiny. Someone is always migrating something. Maintenance windows are not a thing anymore. It's always peak time somewhere.
what happened to immutable? -> ocasionally you're going to need to a cross-cutting thing which touches some service which was last worked on 6 months ago. This is a tricky problem -> ocasionally you're going to need to a cross-cutting thing which touches some service which was last worked on 6 months ago. This is a tricky problem
WIWIK: mandates are bad -> mandates to migrate is bad. Nobody wants to have to adopt some new system. Making somebody change just because the organization needs to change is bad. Carrots and stick don't work... it has to be just carrots all the way. Any time the sticks come out... it's just bad.

Open source

build/buy tradeoff is hard
commoditization -> if what you're doing is good, somebody some time will release it as a service at some point.
WIWIK: this will make people sad -> some people get very invested in their work. it's not good news to hear that Amazon is releasing your same thing as a service. This is not obvious. Behind those text editors there are people.

Politics

services allow people to play politics
company > team > self -> politics happen whenever you violate this property. With high velocity there is a temptation to violate this property. When you want to ship, it is easy to not prioritise what is better for the company.

Tradeoffs

everything is a tradeoff
try to make them intentionally

manzanit0/uber-scaling-notes.md

"What I Wish I Had Known Before Scaling Uber to 1000 Services"

Notes