| # | Failure Mode | Recall Trigger |
|---|---|---|
| FM1 | Single Point of Failure | One component down = system down |
| FM2 | Cascading Failures | One failure triggers the next |
| FM3 | Unbounded Resource Consumption | Memory / connections / threads exhausted |
| FM4 | Data Consistency Failure | Systems disagree on the state of the world |
| FM5 | Latency Amplification | Small latencies × many hops = large total |
| FM6 | Hotspotting | One node gets all the traffic |
| FM7 | Thundering Herd | Many clients retry simultaneously, overwhelming recovery |
| FM8 | Schema / Contract Violation | One side changes; the other side breaks |
| FM9 | Silent Data Corruption | Incorrect data propagates without alerts |
| FM10 | Security Breach | Unauthorised access |
| FM11 | Observability Blindness | System failing; team cannot see why |
| FM12 | Split-Brain | Two nodes each think they are primary |
Use: Pre-mortem — run this list against each component. Post-mortem — name the failure mode first; the prevention follows.