A single point of failure triggered the Amazon outage affecting millions

-



In turn, the delay in network state propagations spilled over to a network load balancer that AWS services rely on for stability. As a result, AWS customers experienced connection errors from the US-East-1 region. AWS network functions affected included the creating and modifying Redshift clusters, Lambda invocations, and Fargate task launches such as Managed Workflows for Apache Airflow, Outposts lifecycle operations, and the AWS Support Center.

For the time being, Amazon has disabled the DynamoDB DNS Planner and the DNS Enactor automation worldwide while it works to fix the race condition and add protections to prevent the application of incorrect DNS plans. Engineers are also making changes to EC2 and its network load balancer.

A cautionary tale

Ookla outlined a contributing factor not mentioned by Amazon: a concentration of customers who route their connectivity through the US-East-1 endpoint and an inability to route around the region. Ookla explained:

The affected US‑EAST‑1 is AWS’s oldest and most heavily used hub. Regional concentration means even global apps often anchor identity, state or metadata flows there. When a regional dependency fails as was the case in this event, impacts propagate worldwide because many “global” stacks route through Virginia at some point.

Modern apps chain together managed services like storage, queues, and serverless functions. If DNS cannot reliably resolve a critical endpoint (for example, the DynamoDB API involved here), errors cascade through upstream APIs and cause visible failures in apps users do not associate with AWS. That is precisely what Downdetector recorded across Snapchat, Roblox, Signal, Ring, HMRC, and others.

The event serves as a cautionary tale for all cloud services: More important than preventing race conditions and similar bugs is eliminating single points of failure in network design.

“The way forward,” Ookla said, “is not zero failure but contained failure, achieved through multi-region designs, dependency diversity, and disciplined incident readiness, with regulatory oversight that moves toward treating the cloud as systemic components of national and economic resilience.”



Source link

Latest news

Thinking About a Pair of Open Earbuds? The Baseus Inspire XC1 Might Be for You

Speaking of critical listening, the XC1 work with Sony’s hi-res capable LDAC Bluetooth codec, should you happen to...

Gear News of the Week: There’s Yet Another New AI Browser, and Fujifilm Debuts the X-T30 III

An increasingly popular solution is the inclusion of a solar panel to keep that battery topped up, enabling...

Amazon Explains How Its AWS Outage Took Down the Web

The cloud giant Amazon Web Services experienced DNS resolution issues on Monday leading to cascading outages that took...

Don’t Let the Fuzzy Rats Win: Tips from a Squirrel Hater Who’s Seen It All

Squirrels: Are they just rats with better PR? Be advised that this is not safe reading material for...

OpenAI’s Atlas Wants to Be the Web’s Tour Guide. I’m Not Convinced It Needs One

The oddest, and most memorable, interaction I had with ChatGPT Atlas occurred as I scrolled around on Bluesky...

The Pepsi Man Is Coming to Save Samsung From Boring Design

Samsung has one of the biggest product line ups of any tech brand, yet when it comes to...

Must read

You might also likeRELATED
Recommended to you