Created
November 21, 2015 16:15
-
-
Save martinamps/c290bc01454d3c1b8c39 to your computer and use it in GitHub Desktop.
Google: The motivation for a monolithic codebase
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Rachel Potvin - Engineering Manager | |
Started career in video game industry. | |
Company work on multiple games at once | |
One game per repo | |
One copy of game engine in each repo + diverge | |
Features would be wanted between diverged game engines, and the merge conflicts ensue. | |
One giant shared codebase | |
Google repository scale | |
1 Billion Files (generated source, config file, data files, documentation, includes copies into release branches, etc) | |
9 million source files | |
2 billion lines of code | |
35 million commits | |
86 terabytes of content | |
45k commits/day | |
Repository Usage | |
25 thousand googlers in dozens of offices around the world | |
15 thousand commits by humans | |
30 thousand commits by automated systems | |
800k QPS of reads at daily peak. Avg 500k | |
Mostly from distributed build/test tools. See Bazel.io for subset. | |
Perspective | |
Linux Kernel repo | |
15 million LOC | |
40,000 files | |
Google repo | |
15 million lines in 250k files changed per week by humans | |
2 Billion LOC, 9 million files | |
Google systems and workflows | |
Sync user workspace to repo ( Basically a fork) | |
Write Code | |
Code review | |
Commit | |
All code is reviewed before commit by humans and tooling | |
Each directory has a set of owners who must approve the change tot heir area of the repo | |
Tests and automated checks are performed before and after commit | |
Auto-rollback of a commit may occur in the case of widespread breakage | |
Google has a tree structure showing owners which must give approval before a commit passes. | |
Piper | |
Stores a single, large repo | |
Implemented on top of standard google infra replicated over 10 data centers worldwide | |
CitC | |
Cloud based storage backend and a local file system view | |
users see local changes overlaid on top of the full piper source tree | |
Users can navigate and edit files across the entire codebase | |
Supports regular tooling on local machines as it (sounds like) it essentially is NFS. | |
All writes are saved as CitC snapshots to make rollbacks easily, tooling works from snapshots. | |
Only modified files are stored in their workspace, but CitC allows you to see the entire codebase seamlessly | |
Tools | |
Critique | |
Code review | |
CodeSearch | |
Code browsing, exploration, understanding and archeology | |
Tricorder | |
Static analysis of code surfaced in Critique, CodeSearch. Code quality, test results, etc. | |
Can offer suggestions for fixes to common errors with one click acceptance | |
Presubmits | |
Customizable checks, testing, can block commit | |
TAP | |
Comprehensive testing before and after commit, auto-rollback | |
Allows teams to defend code against breaking changes from others | |
Rosie | |
Large scale change distribution and management | |
After teams make changes, tests happen and Rosie automatically submits a PR equivalent | |
Google does trunk based development | |
Combined with a centralizedrepo that defines the monolithic model | |
Piper users work at “head”, a conssitent view of the codebase. | |
Commits are immediately visible and usable by other engineers | |
Branching is incredibly rare | |
Avoids painful merges | |
Branches are used for releases - snapshot of trunk with optional cherrypicks. | |
Simple conditionals can mean different versions of code is executed in production. | |
Advantages of a monolithic repository | |
Unified versioning | |
Single source of truth | |
No confusion about which is the authoritative version of a file | |
No forking of shared libraries | |
No painful cross-repo merging of copied code | |
No artificial boundaries between teams/projects | |
Supports gradual refactoring and reorganization of codebase | |
Changes to base libraries are instantly propagated through the dependency chain, greatly simplifying dependency management | |
No broken dependencies downstream (e.g. if D depends on B and C which depends on A and all are differing versions) | |
Entire history of project remains intact and browsable | |
Extensive code sharing and reuse | |
simplified dependency management | |
atomic changes | |
Make large, backwards incompatible changes easily | |
Change hundreds/thousands of files in a single consistent operation | |
Rename a class or function in a single commit, with no broken builds or tests | |
large scale refactoring | |
codebase modernization | |
Single view of the codebase facilities clean-up, modernization efforts | |
Can be centrally managed by dedicated specialists | |
e.g. updating the codebase to make use of c++11 features | |
Monolithic codebase captures all dependency information | |
Old APIs can be removed with confidence | |
Software errors or design mistakes can be found and fixed across the entire codebase and coupled with new compiler warnings or presubmit checks | |
collaboration across teams | |
flexible team boundaries and code ownership | |
code visibility and clear structure providing implicit team namespacing | |
easier to reason about relationship between code | |
Costs associated with this model | |
Tooling investments are valuable but can be costly | |
Development to scale tools | |
Cost of execution of computationally intensive tools (e.g. builds) | |
Codebase complexity is a risk to productivity | |
encourages tons of sharing and reuse | |
Very easy to add dependencies | |
Un-necessary dependencies increase: | |
exposure to build breakage | |
binary sizes | |
costs for building/testing and maintenance | |
Code health must be a priority | |
Tools have been built to: | |
Find and remove unused/underused dependencies and dead code | |
Support large-scale clean-ups and refactoring | |
Google introduced API visibility, with default set to “private” | |
APIs must explicitly be set as appropriate for use | |
APIs can be marked as deprecated | |
Lesson learned: Add these early to encourage sane/hygienic dependency structures. | |
Conclusions | |
Monolithic codebase != monolithic software design | |
Monolithic model of source management works well when coupled with an engineering culture of transparency and collaboration | |
Google has invested heavily in scalability and productivity tooling to support this model, due to the significant advantages it provides | |
This may or may not be the right approach for all companies | |
Google has shown this model can scale to a repo with 1 bn files and 35mm commits, and thousands of users around the globe |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment