The Computing Series

The Four-Step Method

Step 1 — Identify the Archetype (F6)

Before looking at any code or diagram, classify the system by archetype. Use whatever description is available — the product README, the service name, the alert that fired, the team that owned it.

Ask: what does this system do for its users? Match the answer to one of the six archetypes. If it matches two, note the seam between them.

Why first: The archetype tells you which questions are most likely to matter (Q3 for Marketplace systems, Q4 for Social systems), which failure modes are most common (FM6 for Search systems, FM12 for Platform systems), and which infrastructure components you are most likely to find (IC13 for Data Intelligence, IC3 + IC4 for Platform & API).

In the 2:47am case: “The alert is from the checkout flow” → Marketplace & Transaction (A3). The most important question is Q3 (where is the state?) and the most likely failure mode is FM4 (Data Consistency Failure) or FM9 (Silent Data Corruption).


Step 2 — Apply the Review Questions (F5)

Run the seven review questions in order. You will not have complete answers to all of them. That is fine. The gaps are what you are looking for — they are the places the system was not designed for, and therefore the most likely locations of the current failure.

For each question, write down what is established and what remains unknown:

Q1 (Scale): What is the traffic level? Is this incident load-related? Check the metrics dashboard for traffic spikes.

Q2 (Failure): What is the failure? Which component? Start with the component named in the alert, then trace upstream and downstream.

Q3 (State): What state is involved? What databases, caches, or queues does the critical path touch? Check whether the state is consistent across all stores.

Q4 (Latency): What is the latency profile? Is this a latency spike or an error rate spike? Both look like degradation but have different root causes.

Q5 (Evolution): Was there a recent deploy? Check the deployment log. 90% of incidents correlate with recent changes.

Q6 (Security): Is there any indication this is a security event? Unusual traffic patterns, unusual access patterns, data accessed from unusual sources.

Q7 (Observability): Can you see what is happening? Are there distributed traces for the failing requests? Are the logs structured and queryable? If not, you have FM11 (Observability Blindness) and the incident will take longer to resolve.


Step 3 — Name the Tradeoffs (F4)

Once you have a hypothesis about the failure, identify which architecture tradeoff created the conditions for it.

This step serves two purposes. First, it confirms the hypothesis — if you can name the tradeoff, you understand the decision that was made. Second, it tells you what the fix will require — the fix must respect the tradeoff, or it will create a new problem on the other side.

Examples:

Failure: cache serving stale data four minutes after a write. The tradeoff is Consistency vs Performance (AT1). The cache TTL is set high for performance; the consequence is stale reads. The fix (reduce TTL) makes reads more consistent but increases database load. Name the tradeoff before proposing the fix.

Failure: database connection pool exhausted. The tradeoff is Latency vs Throughput (AT2) and potentially Automation vs Control (AT7). The connection pool size was not configured correctly for the traffic volume. The fix (increase pool size) works until the next scale event — the real fix is autoscaling configuration.


Step 4 — Map the Failure Mode (F3)

Name the failure from the F3 taxonomy. This is the final step because it requires the preceding steps to be accurate.

The value of naming the failure mode is not taxonomic precision — it is speed. Once named, the prevention pattern becomes accessible, which points directly to where the fix lives:

Failure Mode Prevention Pattern Where the Fix Lives
FM1 (SPOF) Redundancy Add replicas; remove single-node config
FM2 (Cascade) Circuit breaker, bulkhead Resilience configuration in service mesh or application
FM3 (Unbounded) Connection pool limits, queue depth Resource pool configuration
FM4 (Consistency) Consistency model selection Data store configuration; application write logic
FM7 (Thundering Herd) Jitter, mutex on miss Cache TTL configuration; cache population logic
FM8 (Contract Violation) Schema registry, contract tests Deploy pipeline; schema registry integration
FM9 (Silent Corruption) Idempotency keys, validation Application write path
FM11 (Blindness) Instrument the missing path Observability configuration
FM12 (Split-Brain) Consensus protocol (Raft) Database configuration; leader election

Read in the book →