Video: https://www.youtube.com/watch?v=kb-m2fasdDY
200 engineers -> 2000 engineers in 1y and a half
Regarding microservice deployment:
- immutability
- append only
Uber is more reliable in weekends... because less changes. If it works, don't touch it.
Why services?
- releasing independently
- owning uptime
- using the ""best"" tool... (define best :)
About costs:
- distributed system
- RPC everywhere
- what if it breaks
Subtle costs:
- everything is a tradeoff -> You are ALWAYS giving things up, you might just not know yet which ones.
- you can build around around problems
- trade complexity for politics -> Instead of having to have an awkward with somebody, you can just write more software.
- you get to keep your biases -> if you don't agree with a team or a person... instead of agreeing, you just write more software.
Regarding languages:
initially it was PHP, then started nodejs and python, and then eventually moving everything to Go except data and maps, which is Java and python.
- that's a lot of languages
- hard to share code
- hard to move between teams -> knowledge translates worse from one team to another.
- WIWIK: fragmented culture -> easy to end up having camps around the languages, inherent to human nature.
Regarding RPC:
- HTTP/REST -> at scale, with a lots of people joining every week HTTP weaknesess start to show up: Conventions around statuses, querystrings, restfulness, etc. What you really want is simply to run a function in a server. Who cares about semantics at this point.
- JSON -> human readable. nice. but... it is untyped. It becomes a mess in the future, not right away. When somebody suddenly changes something and two hops downstream they were depending on the type, or a subtle difference such us empty string vs null or whatever language quirk.
- RPC is slow.
- WIWIK: servers are not browsers
How many repos?
- Multiple repos is good to have stuff nicely separated, to even opensource stuff.
- One repo is good for cross cutting changes
- Many is bad because it stresses out build system, navigation system, etc.
- One is bad because it gets so big is unmanageable In Uber they had 8000 repos?
Operational
- What happen when things break? -> Without getting into the downstream services issues, there are a lot of surprising people issues.
- Can other teams release your service?You own your uptime, but another team needs a fix. So can they release your service while the tests pass? is your automation enough so other teams can release your process? Usually no. Usually you're blocked and end of story.
- Understanding a service in the larger context -> The system is a one single machine... even though it's broken into smaller parts. So it's important that we still understand the system working as one.
Performance If there are so many languages... it depends on language tools In Go you have pprof, but not all runtimes have that. The ended up investing time in having a common format for all runtimes -> Flamegraphs.
If tooling is so different, it's friction.
Everybody wants dashboards, but if it's not easy for them to spin them up, teams will end up creating their own. This means that one team's dashboard will be radically different from another's. Getting a standard dashboard with a set of agreeable defaults out of the box for each service is invaluable. Now you can even look at other team's dashboards without knowing the service and understand things.
But should you even care about performance?
- it doesn't matter until it matters. -> buying computers is cheaper than engineers mantra... but then one day you have a perf problem, and it doesn't matter.
- minimum simple perf requirements -> some minimum SLA just so it's a number, but it's a bare minimum. It's a knob to turn at least.
- WIWIK: "good" not required, but "known" is
Tracing Fanout -> causes a lot of perf problems you always have to wait for the slowest in the chain. If 1% of the time a service which on avg takes 1ms takes a second, then with fanout, chances become more quickly: use 100 becomes 63%. Get tracing. Best way to understand fanout. If not tracing, get logs.
RPC in loop -> bulk resolve, don't fanout. But to find these things you need tracing. ORM does a ton of SQL queries -> a seemingly innocent traversal is 10.000 requests. You need tracing.
it's a lot of work to do tracing -> you might want to do sampling? implementing this kind of thing requires cross-lang context propagation... which is a lot of work. but do it, we regretted not doing it sooner.
Logging
- need consistent structured logging -> Everybody logs in different ways... so the best way around this is provide tools that are so easy to use it's impossible to do it in a different way.
- multiple languages makes this hard
- log flooding -> it isn't an issue until it is.
- WIWIK: accounting -> people should have a notion of the costs of what they do; logging too much can be an issue.
zap -> super fast
Load testing
- need to test against production
- without breaking metrics -> avoid people getting scared
- preferably all the time -> some bugs just show in peaks, so you want to be in peak all the time.
- WIWIK: all systems need to handle "test" traffic
Failure testing
- chaos monkey (netflix)
- WIWIK: people won't like it -> some people hate it. Specially if you need to add it later. People won't want to opt in.
Migrations
- old stuff still has to work -> some people just work in legacy; from something that used to be old to something that is less old but still isn't shiny. Someone is always migrating something. Maintenance windows are not a thing anymore. It's always peak time somewhere.
- what happened to immutable? -> ocasionally you're going to need to a cross-cutting thing which touches some service which was last worked on 6 months ago. This is a tricky problem -> ocasionally you're going to need to a cross-cutting thing which touches some service which was last worked on 6 months ago. This is a tricky problem
- WIWIK: mandates are bad -> mandates to migrate is bad. Nobody wants to have to adopt some new system. Making somebody change just because the organization needs to change is bad. Carrots and stick don't work... it has to be just carrots all the way. Any time the sticks come out... it's just bad.
Open source
- build/buy tradeoff is hard
- commoditization -> if what you're doing is good, somebody some time will release it as a service at some point.
- WIWIK: this will make people sad -> some people get very invested in their work. it's not good news to hear that Amazon is releasing your same thing as a service. This is not obvious. Behind those text editors there are people.
Politics
- services allow people to play politics
- company > team > self -> politics happen whenever you violate this property. With high velocity there is a temptation to violate this property. When you want to ship, it is easy to not prioritise what is better for the company.
Tradeoffs
- everything is a tradeoff
- try to make them intentionally