.png)
On October 20, 2025, a major disruption rippled through the internet: AWS — a backbone of online services worldwide — experienced a significant outage that impacted countless applications, websites, and businesses. In this blog, we'll explore what happened, why it matters, and how organisations (and individuals) can respond and build resilience.
What happened?
The scope of the outage
The incident started early Monday morning (US time) in the US‑East‑1 (Northern Virginia) region of AWS.
The outage manifested as elevated error rates and latencies across multiple AWS services — including core components like databases, networking, serverless computing and DNS‑based services.
Many major platforms were affected: gaming titles (e.g., Fortnite), social apps (Snapchat), cloud‑powered fintech, streaming providers and more.
It wasn’t just a single app glitch — it was a cascading disruption across services that depend on AWS infrastructure.
The initial cause
While full root‑cause analysis may take much longer (publicly and internally), AWS cited issues such as networking congestion, DNS resolution failures and internal resource‑scaling triggers. For example:
-
Reddit users noted “DNS record for dynamodb.us‑east‑1.amazonaws.com not resolving”.
-
Historical precedent: AWS once attributed an outage to “automated scaling activity” which triggered latent issues on network devices between its main and internal networks.
-
Broadly, these are symptoms of dependencies within the cloud provider’s internal architecture failing (control plane, networking, routing) which then cascade to customers.
The impact & timeline
Customers began reporting issues globally. One source reported thousands of incidents via Downdetector.
By mid‑morning UK time, many services showed signs of recovery, though some backlogs and residual latency persisted.
Importantly, even though “services are up” quickly becomes the public message, full restoration of all dependent workloads can take more time (back‑processing, queue empties, caches repopulate, etc.).
Why this matters (and what it reveals)
1. The cloud’s centrality & fragility
AWS is ubiquitous — many businesses, big and small, rely on its infrastructure. When AWS stutters, the ripple effect can hit disparate industries. The October 2025 outage illustrated how dependent the digital economy is on a handful of major cloud providers.
The irony: cloud promises high availability and resiliency, yet when the cloud provider itself has an issue, many downstream services become vulnerable.
2. Single‑region risk & cascading dependencies
Most cloud architectures (hopefully) span multiple availability zones or even regions. But when a core region like US‑East‑1 experiences systemic issues, many services anchored there will see disruption.
Moreover, some global services are so tightly coupled (data plane + control plane) that even if they are served from another region, dependencies can still bring them down.
3. Control plane / internal dependencies matter as much as compute
The outage highlights an important point: it isn’t just “our servers went down” — often the control infrastructure (networking between internal services, DNS, authorization, monitoring) is what breaks first. When that happens, the outward‑visible services may fail even if the compute hardware is healthy. AWS has acknowledged this in previous outage analyses.
This means that when designing systems, you must account not only for your own apps failing, but also for your provider’s internal plumbing breaking.
4. Communications & perception matter
Outages of this magnitude don’t just impact functionality: they impact trust. Customers of affected services often get frustrated, angry, or anxious. Timely, transparent communication can make a difference. One critique of earlier AWS outages: their support and status updates were deemed inadequate.
For businesses downstream (who rely on AWS), part of your responsibility is communicating to your customers when you’re impacted and what you are doing.
How organisations can respond & build resilience
Given that AWS outages — or more generally cloud‑provider disruptions — are practically inevitable at some scale, here are best practices and strategies to minimise impact:
1. Multi‑region and multi‑availability zone redundancy
Deploy workloads across multiple regions (if your business allows) or at least multiple availability zones. If one region goes down, traffic can failover to another.
For example: Use regional failover mechanisms (such as DNS via Amazon Route 53) and make sure standby capacity is warm (not zero) to handle sudden load.
2. Use caching, offline modes & backup paths
Design your applications so that, if the cloud backend slows or fails:
-
Frequently used static content is cached (CDNs, edge caches)
-
Critical functionality can operate in reduced mode (queuing writes, local caching)
For example, you may buffer writes locally and flush to the cloud once connectivity returns (ensuring eventual consistency).
3. Monitor the cloud provider’s health & your dependencies
Don’t just monitor your app; monitor the underlying cloud services you depend on. Set alerts for upstream issues (latency, error rates, region status) and activate your disaster response plan accordingly. Tools like the AWS Service Health Dashboard or third‑party aggregators help.
Quick detection gives you lead time to shift traffic or degrade gracefully.
4. Define and rehearse incident response plans
Have a documented incident response plan for cloud‑provider outages. This includes:
-
Failover procedures (DNS switch, region reroute)
-
Customer communication templates (what you tell users when you’re impacted)
-
Roles & responsibilities during the outage (who monitors, who decides to switch traffic)
-
Back‑out plans (when to revert once the primary region recovers).
A prepared team can turn a crisis into a scavenger hunt rather than a scramble.
5. Review your business continuity & architecture assumptions
Ask yourself:
-
What happens if your cloud provider’s US region has elevated error rates for 2–4 hours?
-
What if your provider’s entire service control plane is degraded?
-
Could you operate with reduced functionality (e.g., read‑only mode) while you recover writes later?
-
Are you over‑reliant on a single cloud provider? Some organisations choose multi‑cloud setups for this reason — although multi‑cloud also brings complexity.
6. Post‑mortems and learning
When an outage hits you (or the provider), do a retrospective: what worked, what didn’t? How long did failovers take? Where were communication gaps? These learnings feed into improved architecture, runbooks and training.
What about the individual / end‑user side?
Even if you’re not an enterprise architect, this AWS outage has relevance:
-
If you use a service (app, streaming, gaming) and it’s down, chances are you might be collaterally affected by a broader infrastructure issue — not just your device/internet.
-
For developers or hobbyists using AWS, it's a reminder: no matter how reliable the provider, resiliency starts with architecture. Design your apps with failure in mind.
-
In your personal tech usage: Have alternate apps/services when your primary one goes down. Redundancy applies to consumers too.
Looking ahead: key questions and implications
-
Will cloud providers shift architecture to reduce single‑region impact? Likely yes. We may see more emphasis on global active‑active designs.
-
Will businesses move to hybrid or edge‑cloud models? To reduce dependency on centralised cloud regions, more companies may adopt on‑premises, edge or multi‑cloud strategies.
-
Will pricing or service‑level‑agreements (SLAs) change? Providers may offer more granular outage guarantees for critical services, or companies may negotiate for stronger RMAs (refunds, credits) during wide disruptions.
-
Will regulatory scrutiny increase? As cloud becomes infrastructure‑critical, governments may push for greater transparency on outages and resilience plans. The recent AWS outage raised questions about concentration risk in digital infrastructure.
Final thoughts
Cloud computing has transformed how businesses operate — offering agility, scalability and cost‑efficiency. Yet, events like the October 2025 AWS outage highlight a fundamental truth: outsourcing infrastructure does not outsource risk.
For organisations and developers alike, the key takeaway is: design with failure in mind. Redundancy, monitoring, communication and rehearsal are your shields when the cloud cracks.
And to the end‑user: when your favourite app is suddenly “not working”, it may not be your internet — it might be the invisible infrastructure behind the scenes.
0 Comments