Each Failure Mode in Detail

FM1 — Single Point of Failure

What it is: A component whose failure brings down the entire system or a critical path within it. If removing one component from the architecture produces a total failure, that component is a SPOF.

How it forms: Usually from convenience rather than malice. A single DNS server, a single load balancer, a single database node — each was likely chosen because running one is simpler than running three. The cost of that simplicity is paid in downtime.

How to detect it: Draw the architecture diagram. Pick any component. Ask: if this component fails right now, what else stops working? If the answer is “the critical path,” it is a SPOF.

Prevention: Redundancy (F2 P9 and mental model MM9). The minimum is N+1: one extra instance of every component on the critical path. The correct level of redundancy is determined by the availability requirement in the SLO.

Interaction with other failure modes: FM1 is almost always the trigger for FM2. The SPOF fails, and its failure propagates to everything that depends on it.

FM2 — Cascading Failures

What it is: One component failure produces load or latency increases in dependent components, which then fail, producing further increases in their dependants. The failure propagates in a wave.

How it forms: Tight dependencies without isolation. Service A depends on Service B. Service B becomes slow. Service A’s thread pool fills waiting for B to respond. Service A becomes slow. Service C depends on A. Service C fills. Total outage from one slow service.

How to detect it: Trace every dependency chain. For each component, ask: if its latency increased to infinity, what would happen to its callers? If the callers would fail, there is a cascade risk.

Prevention: Circuit breakers (open the circuit when error rate exceeds threshold), bulkheads (isolate failure domains so one failure cannot consume all shared resources), timeouts (bound the time any call waits for a dependency).

Interaction with other failure modes: Cascading failures frequently compound with FM7 (Thundering Herd) — the cascade causes timeouts, which cause retries, which amplify the load on the recovering system.

FM3 — Unbounded Resource Consumption

What it is: A component or process consumes memory, file descriptors, database connections, or threads without a configured limit, eventually exhausting the resource and crashing the host or the process.

How it forms: Missing limits. A job that opens one database connection per request and never closes them. A cache that evicts nothing and grows until the process runs out of heap. An event queue that has no maximum depth and backs up until it fills memory.

How to detect it: Every resource pool (connection pools, thread pools, memory allocators, queue depths) must have a configured maximum. Audit for any resource that can grow proportionally with request volume without a bound.

Prevention: Explicit bounds on every resource pool. Thread pool max size. Connection pool max size. Cache max entries or max bytes with an eviction policy. Queue depth limit with backpressure or drop-with-alerting.

Interaction with other failure modes: FM3 is often the mechanism behind FM2. The cascading failure propagates because each component in the chain runs out of connections or threads waiting for a slow dependency.

FM4 — Data Consistency Failure

What it is: Two or more components disagree on the state of the world. The database says the inventory count is 0. The cache says it is 5. A customer buys the item. The database was right.

How it forms: Any system with replicated or cached state can have consistency failures. The cache was not invalidated when the database was updated. Two replicas processed writes in different orders. The read was routed to a follower with replication lag during a critical transaction.

How to detect it: Identify every place where the same data exists in more than one location. For each pair: what is the mechanism for keeping them consistent? What is the window of inconsistency? Is that window acceptable for the use case?

Prevention: Explicit consistency model selection per use case. Strong consistency (synchronous replication, quorum reads/writes) for financial data. Eventual consistency with bounded staleness for social content. Never choose a consistency model by default — choose it deliberately.

Interaction with other failure modes: FM4 and FM9 (Silent Data Corruption) are closely related. Consistency failures produce incorrect data; if there is no alerting or validation, the incorrect data propagates silently, becoming FM9.

FM5 — Latency Amplification

What it is: A single request triggers multiple sequential downstream calls, each adding latency. The total response time is the sum of all downstream latencies. At P99, this sum can violate the SLO even when each individual call is within budget.

How it forms: Synchronous call chains. Service A calls B, B calls C, C calls D. The latency of the request to A is at minimum the sum of latencies to B, C, and D. Fan-out patterns are worse: A calls B, C, and D in parallel, but the response waits for the slowest of them.

How to detect it: Distributed tracing (F8 #26) maps the full call chain for any request. The critical path of the trace shows the latency accumulation. Any call chain with more than three synchronous hops is a latency amplification risk at high percentiles.

Prevention: Parallelise independent downstream calls. Cache results of expensive calls where freshness permits. Set per-call timeouts that protect the total budget. Move non-critical downstream calls to async paths.

Interaction with other failure modes: FM5 is often a precursor to FM2. When latency amplification brings a service close to its timeout, any further latency increase tips it over into failure, which propagates upstream.

FM6 — Hotspotting

What it is: One node in a distributed system receives disproportionately more traffic than others. That node degrades under load while other nodes sit idle. The system appears to have capacity but cannot use it.

How it forms: Poor key distribution in hash-based partitioning. A hash ring with too few nodes and no virtual nodes. A partition key that concentrates writes (all writes for a popular user go to the same shard). A load balancer using IP hash routing where one IP has many clients behind it.

How to detect it: Per-node throughput metrics. If one node is at 80% load and others are at 20%, there is a hotspot. Check partition key cardinality — a low-cardinality key (e.g., country code) will produce as many shards as there are distinct values.

Prevention: Virtual nodes in consistent hashing distribute traffic more evenly. High-cardinality partition keys (user ID, order ID). Adding randomness to the partition key for high-volume single entities (e.g., celebrity accounts get a random suffix for fan-out).

Interaction with other failure modes: FM6 leads to FM3 (Unbounded Resource Consumption) on the hot node and potentially FM1 (SPOF) if the hot node is the only one handling a particular partition.

FM7 — Thundering Herd

What it is: Many clients simultaneously attempt an operation that they were previously blocked from, overwhelming the system they are targeting.

How it forms: Two common triggers: a popular cache entry expires simultaneously for all clients (each client misses and hits the database); a service recovers from an outage and all clients reconnect simultaneously. Both create a sudden spike that the system was not designed to handle.

How to detect it: Sudden spikes in database QPS or connection count following a cache expiry event or service restart. These spikes are regular and predictable when cache entry expiry times and service restart schedules are known.

Prevention: Jittered TTLs spread cache expiry events across a time window rather than all expiring at the same instant. Mutex or lock on cache miss ensures only one request populates the cache while others wait. Probabilistic early expiry refreshes the cache before it expires.

Interaction with other failure modes: FM7 amplifies FM2 (Cascading Failures). The thundering herd adds load to a recovering system; the load causes the system to re-degrade; it fails again; the herd forms again.

FM8 — Schema / Contract Violation

What it is: One side of a system boundary changes its interface, data format, or semantic contract without coordinating with the other side. The receiving side breaks.

How it forms: A service removes a field from its API response. A Kafka topic changes its Avro schema. A database table renames a column. In each case, the producer changed the contract without ensuring all consumers were updated first.

How to detect it: Consumer-driven contract testing catches this before deployment. Schema registries (for Kafka, for API schemas) prevent incompatible changes from reaching production.

Prevention: Backward-compatible changes only. Add fields, never remove them. Add enum values, never remove them. Never change field semantics. Hyrum’s Law (F9 #16) means every observable behaviour is a contract; treat it as such.

Interaction with other failure modes: FM8 can produce FM9 (Silent Data Corruption) — if the contract violation causes fields to be silently misinterpreted rather than causing an explicit error, incorrect data propagates without alerting.

FM9 — Silent Data Corruption

What it is: Incorrect data is written to, or propagated through, a system without triggering any alerts or errors. The corruption is only discovered later — often much later — when the incorrect data produces visible consequences.

How it forms: Missing validation at write time. Missing checksums at rest or in transit. Non-idempotent operations retried (F2 P5 absent). Type coercions that silently truncate values (a 64-bit ID stored in a 32-bit field). Double-processing of an event due to at-least-once delivery without deduplication.

How to detect it: End-to-end reconciliation between source of truth and derived systems. Checksums on stored data. Anomaly detection on data distributions. Regular data audits comparing expected to actual values.

Prevention: Idempotency keys for all write operations. Input validation at every system boundary (not just at the user-facing edge). Checksums on stored data. Reconciliation jobs that compare derived state to source of truth.

Interaction with other failure modes: FM9 is the most dangerous failure mode because it does not produce an alert. FM4 (Data Consistency Failure) is visible — systems disagree. FM9 is invisible — systems agree on the wrong value.

FM10 — Security Breach

What it is: Unauthorised access to data, compute, or credentials. Data exfiltration, privilege escalation, account takeover, and supply chain compromise are all instances of FM10.

How it forms: Absent or misconfigured Security Boundaries (F2 P12). Least Privilege violations (F2 P14). Unpatched vulnerabilities. Social engineering. Credential theft from code repositories, environment variables, or log output.

How to detect it: Anomaly detection on access patterns. Monitoring for access to data from unusual sources, unusual times, unusual volumes. Audit logs of all privileged operations.

Prevention: Security Boundaries at every service-to-service call (Zero Trust). Least Privilege for all credentials. Secrets management (not environment variables or source code). Regular dependency audits for known vulnerabilities. OWASP Top 10 as a design checklist.

Interaction with other failure modes: FM10 often follows FM3 or FM5 — an attacker uses a system’s resource consumption or latency amplification to extract information (timing attacks) or exhaust resources as a distraction while exfiltrating data.

FM11 — Observability Blindness

What it is: The system is behaving incorrectly, but the team cannot see what is happening, where the problem is, or why it is occurring.

How it forms: Missing metrics for critical operations. Unstructured log output that cannot be queried efficiently. No distributed tracing, so the latency of multi-service requests cannot be attributed. Alerts that fire too late or not at all.

How to detect it: During any incident, ask: how long did it take to identify the root cause? If the answer is more than fifteen minutes, the system has observability blindness for that failure mode.

Prevention: The three pillars as first-class requirements (F8 #24–26): metrics (what is happening), logs (what happened), traces (why is this slow). The minimum viable stack is one metric system, one structured log aggregator, one tracing system, with alerting on all SLO-impacting conditions.

Interaction with other failure modes: FM11 does not cause other failures directly — it prevents them from being fixed quickly. Any failure mode combined with FM11 becomes a multi-hour incident instead of a fifteen-minute recovery.

FM12 — Split-Brain

What it is: Two nodes in a distributed system each believe they are the authoritative primary. Both accept writes. The writes diverge. When the partition heals, the system has two conflicting histories that must be reconciled — or one must be discarded.

How it forms: A network partition between two nodes in a leader-follower replication setup. Both nodes can still receive client requests. The follower, unable to reach the leader, concludes the leader is dead and promotes itself. The leader, unable to reach the follower, continues serving requests. Two primaries, conflicting writes.

How to detect it: Every write should verify it is being made to the current leader. The Raft consensus protocol prevents split-brain by requiring a quorum for leader election — if a partition leaves neither side with a quorum, neither side accepts writes.

Prevention: Consensus protocols (Raft, Paxos) that require quorum for leadership. Fencing tokens — monotonically increasing tokens that allow storage systems to reject writes from deposed leaders. Avoid leader election mechanisms that allow leadership to be claimed without quorum.

Interaction with other failure modes: FM12 and FM4 are deeply related. Split-brain is the mechanism; Data Consistency Failure is the consequence. In split-brain, both primaries accept writes. When the partition heals, the data on each side is consistent with itself but inconsistent with the other. This is F4 Data Consistency Failure at its most severe.

Read in the book →

← The 12 Failure Modes — Layer 1 (Recall Triggers) How Items Connect — Compounding Failure Modes →