Cloudflare, the CDN provider, suffered a massive outage today. Some of the world's most popular apps and web services were left inaccessible for serveral hours whilst the Cloudflare team scrambled to fix a whole swathe of the internet.
And that might be a good thing.
The proximate cause of the outage was pretty mundane: a bad config file triggered a latent bug in one of Cloudflare's services. The file was too large (details still hazy) and this led to a cascading failure across Cloudflare operations. Probably there is some useful post-morteming about canary releases and staged rollouts.
But the bigger problem, the ultimate cause, behind today's chaos is the creeping centralisation of the internet and a society that is sleepwalking into assuming the net is always on and always working.
It's not just "trivial" stuff like Twitter and League of Legends that were affected, either. A friend of mine remarked caustically about his experience this morning
I couldn't get air for my tyres at two garages because of cloudflare going down. Bloody love the lack of resilience that goes into the design when the machine says "cash only" and there's no cash slot. So flat tires for everyone! Brilliant.
We are living in a society where every part of our lives is increasingly mediated through the internet: work, banking, retail, education, entertainment, dating, family, government ID and credit checks. And the internet is increasingly tied up in fewer and fewer points of failure.
It's ironic because the internet was actually designed for decentralisation, a system that governments could use to coordinate their response in the event of nuclear war. But due to the economics of the internet, the challenges of things like bots and scrapers, more of more web services are holed up in citadels like AWS or behind content distribution networks like Cloudflare.
Outages like today's are a good thing because they're a warning. They can force redundancy and resilience into systems. They can make the pillars of our society - governments, businesses, banks - provide reliable alternatives when things go wrong.
(Ideally ones that are completely offline)
You can draw a parallel to how COVID-19 shook up global supply chains: the logic up until 2020 was that you wanted your system to be as lean and efficient as possible, even if it meant relying totally on international supplies or keeping as little spare inventory as possible. After 2020 businesses realised they needed to diversify and build slack in the system to tolerate shocks.
In the same way that growing one kind of banana, nearly resulted in bananas going extinct, we're drifing towards a society that can't survive without digital infrastructure; and a digital infrastructure that can't operate without two or three key players. One day there's going to be an outage, a bug, or cyberattack from a hostile state, that demonstrates how fragile that system is.
Embrace outages, and build redundancy.
First problem is, more and more people exist on this planet and you need more infra, more providers, bigger bandwiths, and so on. Second problem is, lots of AI models which drain the sh* ton of resources and botted comments, accounts etc. and it's getting even worse. Most of the internet now is botted than it's organic, it was already an issue but now it's an issue on roids. So, my suggestion is to push back against this internet monopolization, propagate more for self-hosting and also maybe in case of an outage, there should be multiple servers backing it up, like a circuit breaker, in a sense one server dies, other one is activated and reset, what people can do in the meantime is watch static or what was cached last on the server so it doesn't just show the error screen, it also reduces the server load and with so many of these data centers caching information and LLMs you can run locally even offline, I think this could be a viable solution, that is another problem I forgot to mention. Third problem is companies switching from being ISP or from just a Telco company and they are building data centers instead and hosting these to train models. So, if anyone proposes an idea I would love to help and take part in. Currently I wanna pursue to become a Systems Architect, but for starters I wanna be a SRE or some sys admin.. and work more on Open Sourced projects. Thanks.