The Computing Series

Failure Modes in This System

FM1 — Single Point of Failure: Any component with no redundancy. The mitigation is active/passive or active/active redundancy. Identifying SPOFs requires listing every component that, if it fails, takes the system down.

FM11 — Observability Blindness: A system that cannot be observed cannot be debugged in production. Metrics, logs, and traces must be designed in from the start, not bolted on after the first outage. Signal: the team cannot answer “how many requests failed in the last five minutes” without a code deployment.

FM2 — Cascading Failures: A failure in one component propagates to others because dependencies are not isolated. Circuit breakers and bulkheads are the mitigations. Any synchronous call between services without a timeout or circuit breaker is a cascade risk.

Read in the book →