Three Decision Scenarios

Scenario 1 — Build vs Buy

The engineering team proposes replacing the in-house authentication system with a third-party identity provider. The proposal is technically reasonable. Should you do it?

Mental Model (F1): The Security Boundaries model (MM12) and the Layered Abstraction model (MM10). Authentication is a security boundary — you are delegating the verification of identity to a third party. The abstraction is good if the third party’s interface is stable and their failure modes are tolerable.

Principle (F2): Security Boundaries (P12) and Fault Tolerance (P9). The third-party provider is now a component on your critical path. What is their SLA? What happens when they have a regional outage? Is your system designed to degrade gracefully if they are unavailable?

Tradeoff (F4): Automation vs Control (AT7). You are trading control of the authentication implementation for the automation of not maintaining it. The cost of control (maintenance, security updates, scaling) is real. The cost of lost control (vendor lock-in, outage dependency, contract terms) is also real. Name both sides.

Law (F9): Hyrum’s Law (L14). Once you adopt the third-party provider, your system will depend on observable behaviours of their API that are not in the contract — specific error message formats, token structure, response timing. Migration away becomes harder over time, not easier.

Review Questions (F5): Q2 (How does it fail?) is the most important. If the identity provider is unavailable, can users still log in from an active session? Can you authenticate with a fallback? If the answer is “no logins when the provider is down,” that is a significant availability tradeoff that the board should know about.

Decision: Build vs buy is a valid tradeoff. The correct answer depends on whether the provider’s reliability and the cost of vendor dependency are acceptable given the business’s availability requirements. Name all of the above, present the tradeoff clearly, and let the requirements determine the answer.

Scenario 2 — Responding to a Major Outage

The payment service is down. 40,000 transactions per minute are failing. The team has five competing hypotheses about the cause. You need to guide the incident to resolution in the next fifteen minutes.

Mental Model (F1): Feedback (MM7). Something has put the system into a self-amplifying loop. The five hypotheses are probably all correct — they are describing the same feedback loop from five different angles.

Failure Mode (F3): Start here, not with the debate. What failure mode is visible in the metrics? FM5 (Latency Amplification) — payment provider response time at 3.2 seconds. FM3 (Unbounded Resource Consumption) — database connection pool at 98% utilisation.

Tradeoff (F4): Synchronous vs Asynchronous (AT10). The root cause is that a synchronous call to the payment provider is holding connections. The immediate mitigation is a shorter timeout, not a root cause fix. The long-term fix is an async payment confirmation pattern.

Law (F9): Little’s Law (L4). At 40,000 TPS with 3.2 seconds average latency, the system is holding 128,000 concurrent transactions — an order of magnitude more than it was designed for.

Decision: The immediate action is clear: reduce the payment provider timeout to 500ms. Transactions that exceed the timeout get a “payment in processing” state and are retried asynchronously. This stops the connection pool from filling. The root cause fix (async payment confirmation with webhook) is a next-sprint project.

The CTO’s role here: Not to diagnose the incident — the on-call engineer does that. To unblock the decision. The team was arguing about root cause because the root cause is interesting. The CTO redirects to the mitigation that stops the bleeding, then schedules the root cause fix.

Scenario 3 — Hiring a Principal Engineer

Two candidates for a principal engineer role. Both are technically strong. The hiring panel is split.

Mental Model (F1): Networks (MM6). A principal engineer’s value is multiplied through the network — how many other engineers become better because of this person? The question is not “is this person technically excellent?” but “does this person make the network stronger?”

Principle (F2): Observability (P10) and Measure & Adapt (P15). Can you tell how this person’s decisions are affecting the system? Do they instrument their work, document their decisions, create feedback loops for the team to learn from? A technically brilliant engineer who produces no signal for the team around them is less valuable than a slightly less brilliant engineer who actively transfers knowledge.

Law (F9): Conway’s Law (L10). What communication structure does this person create? Do they document decisions, or do they become a knowledge bottleneck? Do they design systems that can be maintained without them, or do they become irreplaceable in ways that limit the team’s ability to evolve the system?

Tradeoff (F4): Generality vs Specialisation (AT6). If both candidates are strong, which type of strength does the team need? Deep specialisation in the primary technical domain? Or generalist architectural breadth that can bridge between teams?

Decision: The frameworks do not produce a single answer — they produce a clearer set of questions that the panel should answer before deciding. Present the framework to the panel. “Here is what we are actually choosing between.” The decision will be better for being explicit about what it is.

Read in the book →

← The Decision Hierarchy The Staying-Current Problem →