This is why I'm grumpy and sulking in the (metaverse) corner:
- Businesses solve problems via solutions they sell on the open market
- When it's a software based solution, it's really easy to get things right
- Everything is simple, however everyone overcomplicates it
- It's entirely possible businesses choose certain technology stacks because they feel that by not choosing them they're unable to attract the latest and greatest talent
- And the labour market also feels that by not choosing to learn and understand the latest and greatest technologies, they become unhirable
- An incideous cycle is produced that means everyone is building solutions, and everyone is learning how-to build them, based purely on a fear of missing out (FOMO), and not on the idea that less is more.
So I want to talk about this, but to be clear - I'm an operations guy. I've done software engineering too, and I see how decline there as well, so this is will mostly be about operations, but a bit of software in there too.
This entire piece is broken down into the problem I think exists, what I believe my solution is, and then a big breakdown of some rules (broken into specific problem domains) I try to follow when innovating a new solution for a client.
Just a quick warning: I do absolutely think Docker and Kubernetes (k8s) are simpy not required at all. In most cases they're never needed. If you're sick of anti-Docker, anti-k8s content, then feel free to look away.
"Solve the problem with as few moving parts as possible. In fact, engineer and build things that junior engineers can understand, maintain and scale." - Me.
Look at this coffee machine:
The "Over-Kill T-1000" Coffee Machine
It's a tablet that you interact with to get a machine to make you coffee. It breaks down a lot, and when it does, you have no other choice to but find a working machine or go buy a coffee.
The same applies to those Zip tap things. You push a button and get boiling water. The other button gives you chilled water. My wife was telling how the one at her place of work broke. Because they're a proprietary product, it took three months to get it fixed. Luckily, "[a colleague] pulled a kettle from the cupboard and we could boil water again"... boiling water is now locked behind proprietary, complex boilers and taps so that it's slightly more convenient and looks neat (and maybe they're slightly safer, too?)
I recenely added smart LEDs to my home and now use Siri to control them. It's great when it works, but when it doesn't, it's a pain in the arse. I've overcomplicated the concept of turning on a light and dimming them.
We're doing the same with with our operational practices...
- We're losing sight of the fundamentals, and eventually no one will know how anything works except a few massive corporations who owns all the resources and, most importantly, the highly proficient (deep) knowledge workers
- Eventually, development and operational departments will simply be consuming their APIs and they will know nothing else
- Docker has made ops lazy, and Kubernetes is just even worse
- Abstractions upon abstractions, when the OS has been there for decades and has been very easy to maintain and secure.
- And operating systems today, especially with systemd (love it or hate it - I used to hate it), are very capable of creating security boundaries between applications
- You can also use systemd with plain, none privileged user accounts to enable, start, and operate services without ever touching the
root
user - Ansible (and other CAC tools) make configuring the OS simple - really simple
- Packer makes baking images really, really simple, for every Cloud provider you could care about
- AppArmor and SELinux are simple - you just have to know how and learn the tools
- Ultimately, it's extremely easy to take code written in any language, using any set of libs/dependencies, frameworks, etc., and get it onto a disk and served on the network in a few steps. You don't need containers; you don't need K8s
And some very well respected experts in our field have some interesting things to say about software engineering today, which now extends in operations, too:
- "Preventing the Collapse of Civilization / Jonathan Blow (Thekla, Inc)": https://www.youtube.com/watch?v=ZSRHeXYDLko
- "The Thirty Million Line Problem": https://www.youtube.com/watch?v=kZRE7HIO3vk
- My favourite part is about how it shouldn't take 56,000,000 lines of code to deliver a text file to an end user
- And both Jonathon and Casey aren't even addressing the same problem that we're seeing in operations... that's just the software development cycle. It's getting bad in ops, too.
- We've forgotten how-to choose boring software stacks: https://mcfunley.com/choose-boring-technology
This is a solution to the operations (ops) problem, and a bit of the software problem. You can pick and choose what works for you, from zero to everything. It's up to you. But as a contractor I move around a lot, so I believe I've experienced enough situations, problem domains both small and large, and solutions to all of this to confidently say: we've gone too far, made things too complicated, and the benefits are bloat and abstractions that are making people dumb and our future generations completely ignorant.
Some quotes from Hacker News that I liked on this subject:
"Simple solutions are also easier to refactor into more complex solutions when the complexity becomes necessary. Going the other way is much harder." - https://news.ycombinator.com/item?id=30770305
"When I was younger, I always thought the old guys pushing boring solutions just didn't want to learn new things. Now I'm starting to realize that after several decades of experience, they simply got burned enough times to learn a thing or two [and] had developed much better BS-detectors than 20-something me." - https://news.ycombinator.com/item?id=30769004
So my thoughts on this are:
- Serverless is likely a good option for your (probably) web based application or service...
And yes, I know, Serverless is likely being made available via Docker and probably Kubernetes, but that's fine as you've got literally nothing to manage but the code and the account/bill with the vendor. Or put another way - it's perfectly acceptable to pay a barrista to make you a coffee instead of makingit yourself at home, but if you do choose to make your own at home, you don't need a $15,000 industrial coffee making setup, you need a decent Moka Pot that will cost you sub $50 and some coffee grinds.
-
PaaS is also the next best bet, downwards from Serverless. From there, it's IaaS, not managed K8s.
-
But assuming Serverless and PaaS are not an option...
-
Avoid Docker and Kubernetes. They're just tools, sure, but they're also overcomplicated abstractions that give you the false ideal that you've done something you wouldn't otherwise have been able to run (run some software)
"You cannot operate at scale without Docker or K8s. It's impossible!"
This is simply false...
"My company runs without containers. We process petabytes of data monthly, thousands of CPU cores, hundreds of different types of data pipelines running continously, etc etc. Definitely a distributed system with lots of applications and databases.
We use Nix for reproducible builds and deployments. Containers only give reproducible deployments, not builds, so they would be a step down. The reason that's important is that it frees us from troubleshooting "works on my machine" issues, or from someone pushing an update somewhere and breaking our build. That's not important to everyone if they have few dependencies that don't change often, but for an internet company, the trend is accelerating towards bigger and more complex dependency graphs.
Kubernetes has mostly focused on stateless applications so far. That's the easy part! The hard part is managing databases. We don't use Kubernetes, but there's little attraction because it would be addressing something that's already effortless for us to manage.
What works for us is to do the simplest thing that works, then iterate. I remember being really intimidated about all the big data technologies coming out a decade ago, thinking they are so complex that they must know what they're doing! But I'd so often dive in to understand the details and be disillusioned about how much complexity there is for relatively little benefit. I was in a sort of paralysis of what we'd do after we outgrew postgresql, and never found a good answer. Here we are years later, with a dozen+ postgresql databases, some measuring up to 30 terabytes each, and it's still the best solution for us.
Perhaps I've read too far into the intent of the question, but maybe you can afford to drop the research project into containers and kubernetes, and do something simple that works for now, and get back to focusing on product?"
(https://news.ycombinator.com/item?id=30768244.)
And this:
"I worked at WhatsApp, prior to moving to Facebook infra, we had some jails for specific things, but mostly ran without containers.
Stack looked like:
FreeBSD on bare metal servers (host service provided a base image, our shell script would fetch source, apply patches, install a small handful of dependencies, make world, manage system users, etc)
OTP/BEAM (Erlang) installed via rsync etc from build machine
Application code rsynced and started via Makefile scripts
Not a whole lot else. Lighttpd and php for www. Jails for stud (a tls terminator, popular fork is called hitch) and ffmpeg (until end to end encrypted media made server transcoding unpossible).
No virtualized servers (I ran a freebsd vm on my laptop for dev work, though).
When WA moved to Facebook infra, it made sense to use their deployment methodology for the base system (Linux containers), for organizational reasons. There was no consideration for which methodology was technically superior; both are sufficient, but running a very different methodology inside a system that was designed for everyone to use one methodology is a recipie for operational headaches and difficulty getting things diagnosed and fixed as it's so tempting to jump to the conclusion that any problem found on a different setup is because of the difference and not a latent problem. We had enough differences without requiring a different OS."
(https://news.ycombinator.com/item?id=30768485.)
But what I will say, is this is a very interesting comment on the above:
"Containers are just tar.gz files, you know? The whole layers thing it’s just an optimization. You can actually very simply run those tar.gz files without docker involved, just cgroups. But then you’ll have to write some daemon scripts to start, stop, restart, etc
Follow this path and soon you’ll have a (worst) custom docker. Try to create a network out of those containers and soon a (worst) SDN network appears.
Try to expand that to optimal node usage and soon a (worst) Kubernetes appears.
My point here is: it’s just software packaged with their dependencies. The rest seems inconsequential, but it’s actually the hard part."
(https://news.ycombinator.com/item?id=30768921. )
The above is clearly indicative of the fact Docker, and Kubernetes, are what you'd end up with if you tried to use the Linux kernel primitives they're utilising... but...
The fact others are running complex, distributed software solutions at petabyte scale with thousands of CPU cores tells me that Docker and Kubernetes
You can just not do anything of that and you're still able to run:
- Run the software, probably a virtual environment
- Use the networking stack the OS provides directly
- Use systemd functionality to lock down the process
- Use systemd as a non-privileged user to run everything, without every needing
root
/sudo
- Use AppArmor or SELinux to really lock down the system
- CIS harden the base image everything is using
- And still end up with the same results... a running process.
Now I want to break down the Simple First Engineering (SFE) rules that have been written to help businesses decide on their technology choices.
- Pick a single language and use it across the entire organisation
- Pick a single framework and use it across the entire organisation
- Pick a single CI/CD stack and use it across the entire organisation
- Pick a particular testing methodology and use it across the entire organisation
- Pick a particular GitOps methodology and use it across the entire organisation
- Use Linux on workstations. Windows is not a usable operating system; macOS is slow and getting slower
- Use OKRs
- Use DevOps methodologies to deliver OKRs
- Find an Agile workflow that works and stick to it
- And improve it overtime, tweaking as need be
- Use static documentation - do not use a wiki or Confluence
- Use MkDocs
- Keep the docs inside the repository
- Ask for as little data as you can from customers
- Transmit that data across as few wires/networks as you can
- Process it via as few lines of code as you can
- Security scan, secure, and protect as many of those lines of code as you can
- Regression test as many of those lines of code as you can
- Store the (original or transmuted) data in as few places as you can
- Get the (original or transmuted) data to that storage via as few networks as you can
- Create as few copies of the (original or transmuted) data as you can
- Create encrypted backups of that (original or transmuted) data...
- 3 in total;
- at least 2 across different storage mediums
- with one being remote/off-network
- Regulary check the data can be retreived and utilised as securely as possible, whenever you can
- Use as little (original or transmuted) data as possible for the process/person that needs the data
- Delete as much (original or transmuted) data as you can, as soon as you can
- Audit everything single interaction with the (original or transmuted) data
- Audit everything single interaction with the code that captures, processes or otherwise interacts with the (original or transmuted) data
- Write software that is Cloud aware and can write to logs and metrics to S3 directly
- Write software that executes as close to the hardware as possible
- This generally means Go, Rust, ...
- But Python is awesome too
- Pick a particular Cloud provider and stick with it
- Avoid multi Cloud and its false promises
- Use Infrastructure As Code
- Specifically, Terraform
- Avoid CDKs like Terraform CDK, AWS CDK, and Pulumi
- Use Configuration As Code
- Specifically, Ansible in pull mode
- Avoid creating complex clustered solutions like Salt, Puppet, Chef
- Use Images As Code
- Specifically, Packer
- Maintain baseline images
- Only use Linux, not Windows Server
- This pairs well with your use of Linux workstations
- Use as few abstractions as possible to deliver software to end users
- Kubernetes is an abstraction. Is too complicated for what it (eventually) does: runs a process
- Docker is an overly complex abstraction. You don't need it, except in a few cases
- If you're running Linux locally, it's likely not needed at all, but it's a good tool for running esoteric services like databases, etc.
- Learn to use the operating system
- Allow as few writes to operating system filesystems as possible
- Make the file system immutable, if possible
- Allow as few permissions to manipulate the operating systems as possible
- Make the entire system immutable, if possible
- Produce immutable artifacts and deploy those
- Build an image and deploy it
- Roll forward whenever possible
- Logs and metrics should never touch the local filesystem
- Logs and metrics should cross as few networks as possible
- Logs and metrics should be stored in "dumb" storage first, in their raw state
- S3, for example
- Logs and metrics should be ingested from "dumb" storage, analysed, and acted upon
- Such as processing the event and alerting when an error is found
- Use life-cycle policies to migrate old logs and metrics to cheaper storage options
- Do not try to maintain live clusters that try to keep 100s of GBs of logs indexed and searchable
- Build JIT solutions for looking at logs as and when you need them
- Do not use Kubernetes until scaling becomes a pain point
- Do not use Docker to package and ship software if you're not using Kubernetes
- Do use SAST and DAST tools to scan all code, always
- Use Open Policy Agent (OPA) whereever you can, for everything you can
- Do shift Terraform, Ansible, and other operational tools into the CI/CD pipeline
- Do not allow engineers to run Terraform, Ansible, and other operational tools from their local CLIs
- The only case this is permissible is when you're setting everything up for the first time
...