Full Worked Example

System: A fictional e-commerce checkout system. You have joined the company yesterday. There is an ongoing performance incident.

Step 1 — Archetype: Checkout is Marketplace & Transaction (A3). The defining concern is correctness — no double charges, no lost orders.

Step 2 — Review Questions: - Q2 (Failure): Response time for the checkout API is 4.2 seconds P99. Normal is 300ms. The alert fired when it exceeded 2 seconds. - Q3 (State): The checkout service writes to three stores: Orders DB (writes the order), Inventory DB (decrements stock), and the Payment Service (processes the charge). All three are synchronous in the critical path. - Q4 (Latency): The Payment Service call accounts for 3.8 seconds of the 4.2 seconds P99. The Payment Service vendor has a status page incident posted 45 minutes ago.

Step 3 — Tradeoff: The checkout service makes a synchronous call to an external payment provider (Synchronous vs Asynchronous, AT10). The latency of the provider directly adds to the latency of every checkout request.

Step 4 — Failure Mode: FM5 (Latency Amplification) — the external service latency is being directly propagated to the user. Secondary risk: FM2 (Cascading Failure) — if the checkout service’s thread pool fills waiting for the payment provider, other operations that use the checkout service will also degrade.

Immediate action: Set a timeout on the payment provider call equal to the P99 checkout SLO minus the latency of all other operations. If the vendor exceeds this timeout, return an error to the user promptly rather than waiting. This limits cascade risk.

Longer-term fix: Add a circuit breaker on the payment provider call. When the provider is slow, the circuit opens, and checkout requests are immediately returned with a degraded state (e.g., “payment is being processed, you will be notified”) rather than waiting.

Concept: Reading a System You Did Not Build

Thread: T12 ← Systematic debugging (Book 1, Ch 1) → Incident response (Book 6, Ch 17)

Core Idea: Four steps — identify archetype (F6), apply review questions (F5), name tradeoffs (F4), map failure modes (F3) — form a portable method for quickly understanding any unfamiliar system under any level of time pressure. The method works in interviews, architecture reviews, incident response, and system onboarding.

Tradeoff: Latency vs Thoroughness — the four steps can be done in 5 minutes (incident response) or 3 days (onboarding); the depth of each step scales with the available time

Failure Mode: FM11 (Observability Blindness) — if Step 2 (Q7) reveals missing observability, the remaining steps cannot be completed accurately; fix the observability first

Signal: 2am incident on an unfamiliar system; first week at a new company; system design interview; pre-deployment architecture review

Maps to: Reference Book Ch 5–13 (all framework chapters are the source material for this method); Book 4 Ch 1 (system design methodology); Book 6 Ch 2 and Ch 5 (leadership applications)

Read in the book →

← Three Contexts