Why Millions of Apps Went Dark: The Anatomy of the AWS Outage

By Edmund Adu Asamoah October 20 2025 • 6 min read

Rows of servers in a data center — Even the cloud has cloudy days. Here is how one AWS region took a big part of the internet down.

It started as a slow drizzle, timeouts, retries, and dashboards that would not load. Then the lights went out for a lot of the internet. Food delivery stalled, internal tools froze, and streams buffered. An AWS outage rippled across apps that millions use every day.

This is a clear look at what likely happened, why one region can shake the web, and what builders can do to keep products usable when the cloud has a bad day.

Primary region

US EAST 1

Blast radius

Global

Root cause class

Network DNS

Service impact

Auth, APIs

Single regions fail. Design so a regional hiccup does not become a company outage.

The chain reaction

US EAST 1 carries a lot of control plane traffic. When internal lookups and routing misbehave there, authentication, service discovery, and data replication stumble elsewhere. APIs that depend on internal calls slow down, then fail. Retries pile up and create their own traffic spike.

Technical root cause, simplified

AWS later pointed to a network and DNS problem inside their systems. Misrouted or failing internal DNS means services cannot find each other. Timeouts trigger retries, which increase load, which creates a feedback loop. The fix is to stop the loop, correct routing, and let caches and health checks repopulate.

Symptom: increased error rates and timeouts for control plane APIs.
Effect: auth, queueing, and storage endpoints intermittently unreachable.
Result: downstream customer apps report 5xx errors and stalled jobs.

Who felt it most

Apps with tight real time loops and many dependencies, streaming, delivery, payments, internal dashboards, and CI runners. Even static sites on S3 and Amplify saw degraded behavior when control plane operations were needed.

Why it matters

The internet is concentrated. A single region wobble can cost thousands of businesses money and trust. Resilience is not a checkbox, it is a design choice that touches architecture, operations, and communication.

Lessons for engineers

Design for failure. Assume every dependency can be slow or unavailable.
Use health checks, timeouts, retries, and circuit breakers to avoid cascades.
Spread critical workloads across multiple AZs and regions. Test the failover.
Keep a dark launch path for status banners and feature flags when the backend is unhappy.
Practice incident drills and write postmortems people can learn from.

Key takeaways for teams

Know your single points of failure and remove them where possible.
Keep status pages, runbooks, and contact trees easy to reach.
Fail fast and loud inside the team, fail soft for users.
After the storm, log what worked and what hurt, then ship one improvement.

Outages happen. What matters is how fast you degrade gracefully, how clearly you talk to users, and how you harden the system after.

Run a resilience sprint this week

Pick one user facing flow and one backend dependency. Add a failure path, a status banner, and a clear runbook entry. Small steps compound.

Turn on a work focus and an evening focus today.
Add your favorites and let repeated calls through. Remove noisy apps from day mode.
Set a simple schedule so these modes run without you.

Tell me your first step Work with me

0 likes

Rate this post:

Not rated