A single point of failure triggered the Amazon outage affecting millions

-



In turn, the delay in network state propagations spilled over to a network load balancer that AWS services rely on for stability. As a result, AWS customers experienced connection errors from the US-East-1 region. AWS network functions affected included the creating and modifying Redshift clusters, Lambda invocations, and Fargate task launches such as Managed Workflows for Apache Airflow, Outposts lifecycle operations, and the AWS Support Center.

For the time being, Amazon has disabled the DynamoDB DNS Planner and the DNS Enactor automation worldwide while it works to fix the race condition and add protections to prevent the application of incorrect DNS plans. Engineers are also making changes to EC2 and its network load balancer.

A cautionary tale

Ookla outlined a contributing factor not mentioned by Amazon: a concentration of customers who route their connectivity through the US-East-1 endpoint and an inability to route around the region. Ookla explained:

The affected US‑EAST‑1 is AWS’s oldest and most heavily used hub. Regional concentration means even global apps often anchor identity, state or metadata flows there. When a regional dependency fails as was the case in this event, impacts propagate worldwide because many “global” stacks route through Virginia at some point.

Modern apps chain together managed services like storage, queues, and serverless functions. If DNS cannot reliably resolve a critical endpoint (for example, the DynamoDB API involved here), errors cascade through upstream APIs and cause visible failures in apps users do not associate with AWS. That is precisely what Downdetector recorded across Snapchat, Roblox, Signal, Ring, HMRC, and others.

The event serves as a cautionary tale for all cloud services: More important than preventing race conditions and similar bugs is eliminating single points of failure in network design.

“The way forward,” Ookla said, “is not zero failure but contained failure, achieved through multi-region designs, dependency diversity, and disciplined incident readiness, with regulatory oversight that moves toward treating the cloud as systemic components of national and economic resilience.”



Source link

Latest news

Lenovo’s Legion Go 2 Is a Good Handheld for Power Users

The detachable controllers go a long way towards making the device more portable and usable. The screen has...

Why Tehran Is Running Out of Water

This story originally appeared on Bulletin of the Atomic Scientists and is part of the Climate Desk collaboration.During...

Move Over, MIPS—There’s a New Bike Helmet Safety Tech in Town

Over the course of several hours and a few dozen trail miles, I had little to say about...

Security News This Week: Oh Crap, Kohler’s Toilet Cameras Aren’t Really End-to-End Encrypted

An AI image creator startup left its database unsecured, exposing more than a million images and videos its...

Gevi’s Espresso Machine Works Fine, but There Are Better Options at This Price Point

The coffee gadget market has caused a massive proliferation of devices for all tastes, preferences, and budgets, but...

Gear News of the Week: Google Drops Another Android Update, and the Sony A7 V Is Here

It was only back in June that Android 16 delivered a raft of new features for Google's operating...

Must read

You might also likeRELATED
Recommended to you