Monday’s paralysis that laid global giants – from Zoom, to Slack, to Fortnite – was like a blow to the IT industry. In the age of the ‘decentralised’, infinitely scalable cloud, how is it possible for a problem in a single physical building in Virginia (the famous AWS US-EAST-1 region) to shut down half the internet? After all, we got away from our bursting at the seams basement server rooms precisely to avoid such scenarios.
And yet, we were back to square one. Looking at the red dashboards, we have come to face a brutal truth: this failure is not an argument against the cloud as a concept. It is a painful lesson in economics and a validation of the compromises the industry accepts every day, choosing convenience and low cost, confusing flexibility with immortality.
The reality of centralisation
For years we have been sold the promise of the public cloud. You only pay for what you use. You scale your infrastructure in minutes, not months. You don’t worry about power, cooling or replacing drives. This is the undisputed revolution that has enabled startups like Snapchat and Canva.
However, this revolution comes at a price: addiction. Today’s internet, contrary to idealistic visions, is not a decentralised utopia. It is an oligopoly in which three companies – Amazon Web Services, Microsoft Azure and Google Cloud – hold the keys to almost the entire digital economy.
The failure in the US-EAST-1 region is perfect proof of this. It is AWS’ oldest, largest and often default region. Over the years, it has become a technological centre of gravity, hosting key services on which Amazon’s other global functions depend. In this way, even within the single-provider ecosystem, we have created a single point of failure (SPOF) for services that consider themselves ‘global’. When this one block of dominoes fell, it pulled the rest with it.
The myth of “sufficient backup”
By now, many IT managers and CFOs, seeing the scale of the problem, have probably breathed a sigh of relief: “at least we have backups”. This is a fundamental misunderstanding of the problem that needs to be clearly corrected.
Backup protects against data loss. It does not protect against service failure.
Monday’s incident, which originated from bugs in the DynamoDB database API, was not (in all likelihood) about data loss. Snapchat user data or Duolingo game progress was almost certainly safe, replicated and secured in Amazon’s data centres.
The problem was that the API – the ‘door’ that allows the application to access this data – was not working.
Having a backup in such a situation is useless. It is like having a perfect copy of the key to a safe that is locked in a burning building that cannot be accessed. You can have hundreds of backups, but if the entire computing, network and service platform fails, your data is simply inaccessible until the failure is fixed.
“Digital twin” – the holy grail of reliability
So what is the real safeguard against such a situation? There is one answer, albeit an extremely complicated one: architectural redundancy at supplier level.
The holy grail of resilience is multi-cloud architecture. We’re not talking about using one service from Google and another from Amazon. We’re talking about having a full ‘digital twin’ (digital twin) of our entire application on another, competing cloud provider.
Imagine an ideal scenario: our service runs simultaneously on AWS infrastructure and in parallel on Microsoft Azure. Special systems (e.g. DNS) monitor the status of both platforms. When the US-EAST-1 region in AWS starts reporting errors, all user traffic is automatically redirected to the twin infrastructure in Azure within seconds. The end user doesn’t notice anything, except perhaps a temporary slowdown.
\ÒSounds ideal. So why does almost nobody do it?
The brutal economics of reliability
The answer is trivial and brutally honest: money. Implementing a true multi-cloud architecture is simply not cost-effective for 99% of companies worldwide, including many listed giants.
We’re not talking about doubling your monthly cloud bill. We are talking about multiplying it, and the biggest costs are hidden.
1. technological costs (complexity): It is not possible to simply copy an application from AWS to Azure. Each provider has its own unique services and different APIs. A DynamoDB (AWS) database is not the same as Cosmos DB (Azure) or Spanner (Google). Maintaining application logic that can run seamlessly on two different technology foundations is a mammoth engineering challenge.
2. operational costs (people): This architecture requires having dual, highly specialised engineering teams. You need AWS experts and Azure experts. In an era of IT talent shortages, this is a luxury that only a few can afford.
3. data synchronisation costs: This is the most difficult element. How do you ensure that user data (e.g. a new bank transaction or an item won in a game) is consistent between databases in Virginia (AWS) and Texas (Azure) in the same millisecond? The data transfer costs and logic complexity of such replication are astronomical.
And here we come to the bottom line. Business knows how to calculate. Companies like Zoom, Duolingo and Roblox have consciously made a risk calculation. The cost and image loss associated with a few hours of downtime once every one or two years is acceptably lower than the constant, gigantic cost of maintaining true multi-cloud redundancy.
A lesson we must reject
The failure, then, is not a failure of AWS engineers or evidence of the weakness of the cloud. It is a failure of our illusions about it.
The cloud is a tool. It can be used to build a low-cost, flexible infrastructure that is nevertheless fundamentally dependent on a single provider. Or it can be used to build an extremely expensive, complex and truly fault-tolerant fortress, running across multiple providers.
Choosing both at the same time – cheap, simple and 100% reliable – is a privilege that almost no one can afford.
aThe Virginia incident forced the entire industry to answer the question: how much are we really willing to pay for 100% availability? It turns out to be much less than we like to talk about at industry conferences.
