The Computing Series

Why This Framework Exists

On 28 February 2017, Amazon S3 in the US-EAST-1 region became unavailable. The cause was a single command, executed by an engineer debugging a billing system, that removed more servers from the subsystem managing S3 metadata than intended. The metadata subsystem could not recover with so few servers. S3 went down. Because so many services depended on S3 — not just for storage, but for configuration loading at startup — dozens of other AWS services and thousands of products that relied on them went down too. Some products could not restart because they could not load their configuration files from S3.

Four failure modes are visible in this one incident: a Single Point of Failure in the metadata subsystem (one command could take out the whole thing), Cascading Failures as S3’s unavailability propagated through every dependent service, a failure of Fault Tolerance design (services that could not start without S3 had a hard dependency that should have been softened), and Observability Blindness in the early minutes before the scope of the failure was understood.

The twelve failure modes are not a historical curiosity. They are a pre-mortem checklist. Before any system goes to production, run this list against every component. The failure modes you do not name before launch will name themselves in an incident.

Every failure mode has a corresponding absent principle from F2. When you find a failure mode in a post-mortem, check F2 to find its prevention.


Read in the book →