The Computing Series

Where It Fails

FM1 — Single Point of Failure

Every single-machine architecture is a single point of failure. At small scale, this is acceptable — the machine rarely fails, and when it does, the cost is low. At large scale, failures are no longer rare events — they are a regular occurrence that the system must survive. The failure mode is not “the machine crashed” — it is “we designed as if the machine would never crash, and now it has.”

FM3 — Unbounded Resource Consumption

Systems that work at 1,000 requests/second often have memory leaks, connection pool limits, and unbounded queue sizes that only manifest at 100,000 requests/second. The failure mode is: the system appeared to work at low scale; at high scale it consumes resources until it crashes. The signal: latency rises without explanation, memory grows monotonically, or the system becomes unresponsive under load spikes.


Read in the book →