It started as a slow drizzle, timeouts, retries, and dashboards that would not load. Then the lights went out for a lot of the internet. Food delivery stalled, internal tools froze, and streams buffered. An AWS outage rippled across apps that millions use every day.
This is a clear look at what likely happened, why one region can shake the web, and what builders can do to keep products usable when the cloud has a bad day.
Single regions fail. Design so a regional hiccup does not become a company outage.
US EAST 1 carries a lot of control plane traffic. When internal lookups and routing misbehave there, authentication, service discovery, and data replication stumble elsewhere. APIs that depend on internal calls slow down, then fail. Retries pile up and create their own traffic spike.
AWS later pointed to a network and DNS problem inside their systems. Misrouted or failing internal DNS means services cannot find each other. Timeouts trigger retries, which increase load, which creates a feedback loop. The fix is to stop the loop, correct routing, and let caches and health checks repopulate.
Apps with tight real time loops and many dependencies, streaming, delivery, payments, internal dashboards, and CI runners. Even static sites on S3 and Amplify saw degraded behavior when control plane operations were needed.
The internet is concentrated. A single region wobble can cost thousands of businesses money and trust. Resilience is not a checkbox, it is a design choice that touches architecture, operations, and communication.
Outages happen. What matters is how fast you degrade gracefully, how clearly you talk to users, and how you harden the system after.
Pick one user facing flow and one backend dependency. Add a failure path, a status banner, and a clear runbook entry. Small steps compound.