AWS Blackout: The Centralized Cloud Crash that Took Down the Web

TechTrek Admin
Mar 9
5 min read

By Diya Poluru ’29;

Tech Associate; The Lawrenceville School, NJ

On Monday January 20, 2025, Amazon Web Services suffered a major outage. However, the outage did not just affect sites like Amazon, but actually took a multitude of applications and online tools offline, affecting people from all around the world. AWS is described by CNBC as the “leading provider of cloud infrastructure technology, accounting for about a third of the market”. When AWS crashed, the outage took down sites like Canvas, Amazon, Snapchat, Disney+, Roblox, Fortnite, Starbucks, and Reddit, all built on this US-based giant. But what is the reason why the AWS blackout happened in the first place?

Firstly, to get to the root of the outage, it is critical to recognize what AWS itself depends on to function. Amazon Web Services relies on a database called DynamoDB, which holds thousands and thousands of records, and uses the server extensively to store customer data and information. On the night of October 19th in the US East-1 region, the AWS DynamoDB service went down and impacted thousands around the world for roughly 15 hours. The crash affected those outside of the US East-1 region because this particular AWS region is one of the oldest and largest, being depended on by multiple critical services worldwide.

However, the whole situation occurred because of a “race condition” in DynamoDB’s DNS system. DNS, which stands for Domain Name System, is described on many occasions as a “phone book for the internet”. Essentially, when one looks up a site or an application to connect to, their computer needs to search for the site’s actual IP address, which is where the DNS comes in. DNS takes site names like G oogle.com and converts them into unique IP addresses that computers can then identify and read. DynamoDB stores massive amounts of IP addresses and information, which means that the server is constantly in need of updates on its sites.

Now, back to the “race condition”. According to Tobias Schmidt from AWS Fundamentals, a race condition is defined as “the condition of an electronics, software, or other system where the system’s substantive behavior is dependent on the sequence or timing of other uncontrollable events”. For the case of DynamoDB, two parts of the DNS, called enactors, completed the same task at the same time, but through different methods. Let’s call them Enactor A and Enactor B. Enactor A was generating an older plan using older materials for updates, which became slow. Afterwards, Enactor B generated a new and fast plan for updates, updated everything too quickly, and ran a process to clean up the files that Enactor A was using. At the same time, Enactor A’s plan finished processing and applied to the main DynamoDB endpoint, but the cleanup process deleted Enactor A’s plan, which ultimately deleted all the IP addresses in DynamoDB’s main endpoint, and the DNS record got wiped, taking down massive amounts of data. Then, users worldwide experienced the effects. The automatic system was also shut down, causing an extended delay for the servers to get working again, as AWS engineers had to manually go in and fix everything by hand.

Essentially, the AWS outage was a domino or ripple effect. The DNS in DynamoDB services went down, and AWS went down with it. As a result, millions of companies that relied on AWS were affected, and applications such as Canvas, Snapchat, Roblox, Fortnite, and many other widely used sites were unable to proceed with their functions, as they were unable to reach and access customer data, and those who tried to access the applications were met with an error screen.

Following the major outage that started in one central point and ended up affecting much of the world, critical concerns about centralized cloud services have risen. When tons of systems end up relying on a critical common point, all it takes for everything to come crashing down is one single point of failure. A tiny glitch creates major ripples globally, which can cause damage to our economy and to users all around the world. Because the world is so interconnected, one tiny mishap can cause a huge domino effect that affects companies in all areas: entertainment, business, and even government services.

Additionally, with the fixing of the problem comes financial impact as well as negative impacts on companies regarding reputation, costs, revenue, and, for certain sites, user engagement. For lots of companies, increased recovery costs could lead to less profit and would make them lose money. Site crashes also potentially lessen user engagement and reputation because applications like Roblox and Fortnite, which have people constantly trying to access them for entertainment, would make users lose interest if they were not able to access the site for an extended period of time. If the issue is prolonged, people might stop using certain sites, including bank services, because of potential unreliability. The outage emphasizes that over-reliance on just a few major providers can lead to negative impacts on large numbers of people for such a tiny detail, highlighting that even US-based giants can make mistakes. For the future, the AWS outage sets a strong precedent that heavily suggests that humans need to turn to less widespread dependency on just a select number of cloud services, having more variety in the systems the world runs on. After all, humans have already seen the massive impact one glitch can have on the world in a matter of minutes. If this issue is not brought up now, who knows how much bigger the ripple can become if history repeats itself?

References

ByteMonk. (2025, November 4). How a Tiny Bug Crashed AWS | DynamoDB us-east-1 Outage Explained. Youtube. Retrieved November 7, 2025, from https://www.youtube.com/watch?v=ey0HsdZSpoc

Dmitracova, O. (2025, October 21). Huge global outage impacts Amazon, Fortnite and Snapchat. CNN. Retrieved November 9, 2025, from https://www.cnn.com/business/live-news/amazon-tech-outage-10-20-25-intl

Faguy, A. (2025, October 21). AWS outage: Are we relying too much on US big tech? BBC. Retrieved November 9, 2025, from https://www.bbc.com/news/articles/c0jdgp6n45po

PAKSU, S. (2025, May 24). The Risks of Over‑Reliance on Centralized Digital Services and Platforms. Medium. Retrieved November 9, 2025, from https://spaksu.medium.com/the-risks-of-over-reliance-on-centralized-digital-services-and-platforms-932ccd1a35bf

Palmer, A., Lockwood, T., & Bishop, K. (2025, October 20). AWS services recover after daylong outage hits major sites. CNBC. Retrieved November 9, 2025, from https://www.cnbc.com/2025/10/20/amazon-web-services-outage-takes-down-major-websites.html

Schmidt, T. (2022, September 15). Understanding & Handling Race Conditions at DynamoDB. AWS Fundamentals. Retrieved November 9, 2025, from https://awsfundamentals.com/blog/understanding-and-handling-race-conditions-at-dynamodb

Taylor, J. (2025, October 24). Amazon reveals cause of AWS outage that took everything from banks to smart beds offline. The Guardian. Retrieved November 9, 2025, from https://www.theguardian.com/technology/2025/oct/24/amazon-reveals-cause-of-aws-outage

What is DNS? | How DNS works. (n.d.). Cloudflare. Retrieved November 9, 2025, from https://www.cloudflare.com/learning/dns/what-is-dns/

AWS Blackout: The Centralized Cloud Crash that Took Down the Web

Related Posts

Comments