FM1 — When One Component Takes Down Everything

October 21, 2016. Dyn, a managed DNS provider, suffers a massive DDoS attack. Twitter goes offline. Netflix goes offline. Reddit, GitHub, Spotify, and hundreds of other services go dark. One DNS provider. One component. Half the internet.

These companies ran redundant servers across multiple regions. They had failover for their databases, load balancers, and application tiers. None of that mattered. They all depended on the same DNS provider. When Dyn went down, no amount of internal redundancy could save them.

What Happened

Dyn provided authoritative DNS resolution for thousands of domains. When a user typed twitter.com, their browser asked a DNS resolver to translate that name into an IP address. The resolver asked Dyn. Dyn answered.

On October 21, a botnet of compromised IoT devices flooded Dyn's servers with traffic. Three waves of attacks hit across the day. Each wave overwhelmed Dyn's capacity. DNS queries for thousands of domains went unanswered. Browsers could not resolve hostnames. Services appeared completely dead even though their servers were running fine.

User → DNS Resolver → [DYN — DOWN]

                     Twitter servers  (running, unreachable)
                     Netflix CDN      (running, unreachable)
                     GitHub           (running, unreachable)
                     Spotify          (running, unreachable)

The irony: Twitter's servers never went down. Netflix's CDN kept running. The applications were healthy. Users simply could not reach them because the name resolution layer had a single point of failure.

Why It Was Not Obvious

Managed services feel redundant. Dyn operated a globally distributed DNS network. They had servers on multiple continents. They had handled DDoS attacks before. Choosing Dyn felt like the responsible engineering decision.

The hidden assumption: the vendor's redundancy is your redundancy. It is not. Dyn's internal redundancy protected against individual server failures. It did not protect against an attack that overwhelmed their entire network. The companies that depended on Dyn had outsourced fault tolerance to a single vendor. They had created a single point of failure at a layer they did not control.

One DNS provider is simpler to operate. Multiple DNS providers eliminates the single point of failure — but requires keeping records synchronized across providers. Every team that chose simplicity on October 21 paid for it with a full outage.

The Failure Mode

FM1 is the most fundamental failure mode in systems engineering. One component fails. The entire system fails. No degraded mode. No partial availability. Complete outage.

FM1 hides in layers you do not think about. DNS. Certificate authorities. Payment processors. Cloud provider regions. The load balancer itself. Any component that sits on every request path and has no backup is an FM1 risk.

The Absent Principle

Fault tolerance demands you assume failure will happen and design to detect, contain, and recover from it.

The Dyn outage violated this at the organizational level. Each company had fault tolerance inside their own infrastructure. None had fault tolerance at the DNS layer. Fault tolerance is not transitive. Your vendor's redundancy does not make your system fault-tolerant. It makes your vendor fault-tolerant. You need your own redundancy at every critical dependency.

Three Prevention Patterns

Pattern 1: Multi-Provider DNS

Configure two or more DNS providers as authoritative for your domain. If Provider A goes down, Provider B continues to resolve. Zone transfer automation keeps records synchronized. After the Dyn outage, major companies adopted this pattern immediately.

Pattern 2: Health-Check Based Failover

Deploy external health checks that monitor every critical dependency from outside your network. When a dependency fails, trigger automatic failover. Monitoring from inside your network tells you nothing about what users experience.

Pattern 3: Architect for Degraded Mode

Design the system to degrade gracefully rather than fail completely. For DNS: implement client-side DNS caching with extended TTLs. If resolution fails, the client uses a cached IP. Stale access is better than no access.

For every component on the critical path, answer: what happens if this component is completely unavailable for one hour? If the answer is "total outage," you have found an FM1 risk.

Platforms and Blast Radius

Platforms that other systems depend on become single points of failure by definition. Dyn was a platform. AWS is a platform. Stripe is a platform. Every consumer of the platform inherits the platform's failure modes. The more successful a platform becomes, the larger the blast radius when it fails.

Platform builders must design for the failure modes they create in others. Platform consumers must design for the failure modes they inherit.

The Dyn outage taught one lesson above all others. Your architecture diagram shows the systems you built. It does not show the systems you depend on. FM1 lives in the dependencies you forgot to draw.

Concept: FM1 (Single Point of Failure)

Tradeoff: AT5 — one DNS provider is simpler to manage; multiple providers eliminate the single point of failure at the cost of synchronization complexity

Failure Mode: FM1 — one component fails, the entire system goes offline with no degraded mode, regardless of internal redundancy

Signal: Any component that sits on every request path and has no alternative

Series: Book 3, Ch 2