Reading a System You Didn’t Build

Introduction

At 2am, the on-call alert fires. Checkout is degraded. Orders are not completing. The engineer who built the payment integration left six months ago. The runbook is three versions out of date. You have a distributed system you did not build, a failure that is active, and no time to read documentation.

This is not an unusual situation — it is the normal situation for any technical leader who has joined a company, inherited a team, or scaled beyond the systems they personally designed. The question is not whether you will face unfamiliar systems. The question is how fast you can become productive on them.

The gap between reading a system in 90 minutes and reading it in three days is not intelligence. It is method. Engineers who get productive quickly on unfamiliar codebases are not smarter — they have a traversal strategy. They start at the right abstraction level, ask the right questions in the right order, and know when to stop reading and start forming hypotheses.

The Decision

The decision is methodological: given a system you did not build, in what order do you gather information, and at what point do you trust your model enough to act?

This question has two modes. The first is non-urgent: you have joined a team and have days or weeks to build a working model of the system. The second is urgent: something is wrong now, and you need a hypothesis within minutes. The method is the same; the depth of each step differs.

What the Frameworks Say

F6 (Archetypes) is where you start — not at the code. Before reading a single file, identify what category of system this is. A checkout service at an e-commerce company is almost certainly a request-response system with synchronous dependencies on inventory and payment, likely an event-driven archetype at the integration boundary. Knowing the archetype gives you the prior: what the system probably looks like, what failure modes it is naturally exposed to, what tradeoffs were almost certainly made.

F5 (Review Questions) gives you the structured probe. The seven questions (adapted from F5 for the leadership context) — what does this do, who uses it, what are its dependencies, what does it guarantee, what does it not guarantee, what would break it, and what does failing look like — are not asked in any order. They are asked in this order because each answer primes the next. Answering “what does it guarantee” (question four) without first knowing “what are its dependencies” (question three) produces a false answer. The dependencies are what make the guarantee expensive to maintain.

F3 (Failure Modes) runs after F5. Once you understand the dependencies and guarantees, you can enumerate which failure modes the system is exposed to. A synchronous dependency on a payment provider is exposure to FM5 (Latency Amplification) when the provider is slow, and FM1 (SPOF) if the provider has no retry path. These are not guesses — they follow from the architecture.

F4 (Tradeoffs) names the decisions that produced the failure exposure. The synchronous dependency was a choice — AT10 (Synchronous vs Asynchronous). It bought simplicity. It cost resilience at the dependency boundary. Understanding the tradeoff does not mean the decision was wrong. It means you know what condition would make it wrong.

The Reading Process

Architecture diagram

The traversal moves from category to structure to risk — not from symptom to fix. Starting at the archetype gives a prior about what the system probably looks like; the seven review questions test that prior against reality; the failure mode enumeration converts the tested model into actionable hypotheses. Engineers who reverse this order — starting at the log line and working outward — build detailed knowledge of the wrong layer while the actual failure continues.

The Forces at Play

The force working against structured reading under time pressure is cognitive shortcutting. When something is broken, the instinct is to start at the error — find the log line, trace the stack, fix the immediate cause. This is productive for shallow failures. It is counterproductive for architectural failures, where the log line is a symptom and the cause is a decision made months earlier.

The opposing force is analysis paralysis: reading the system until you feel you understand it, which never happens for complex systems. A working model is not a complete model. The question is when the model is good enough to generate useful hypotheses. F5 gives a proxy answer: when you can answer all seven questions with confidence, the model is sufficient to generate hypotheses worth testing.

There is also the organisational force of institutional knowledge concentrated in individuals. The engineer who built the system knows things that are not in the code. Getting to that person — or their documentation — is part of reading the system, not a shortcut around it.

The Options and Tradeoffs

Reading breadth-first — understanding the system boundary before the internals — applies AT6 (Generality vs Specialisation). Breadth-first reading is slower to get to any specific file but faster to understand the whole. Depth-first reading gets to specific files fast but can produce a detailed understanding of the wrong layer.

For urgent situations, F6 and F3 are sufficient to generate the first hypothesis. Identify the archetype, enumerate the most likely failure modes given what is visibly broken, check the most exposed component. This is not a complete system read — it is a diagnostic probe. AT6 applies in the other direction: accepting less completeness for faster action.

The tradeoff in both cases is AT9 (Correctness vs Performance). A more complete system read produces better hypotheses, but better hypotheses are only valuable if there is still time to act on them.

What Great CTOs Do

Technical leaders who read unfamiliar systems well do two things that distinguish them. First, they separate the system model from the failure model. They build the model of the system (F6, F5) before they build the model of the failure (F3). Engineers who skip to the failure model build a narrative that makes the symptoms make sense but may miss the actual cause.

Second, they commit to a hypothesis before gathering more evidence. Experienced technical leaders name their current best hypothesis explicitly — “I think this is FM5 at the payment dependency, and the symptom is order queue backup” — because naming it forces the hypothesis to become falsifiable. Evidence gathering that has no hypothesis is just reading.

They also make the system read collaborative when time permits. One person asking the F5 questions out loud while another verifies against the actual system is faster than one person doing both — the externalised model catches the errors the single reader misses.

What Goes Wrong

The most common failure is starting at the symptom instead of the archetype. Engineers who trace the stack from the error log are using depth-first traversal when the failure is architectural. They find technically correct information — the call stack, the failing service, the log message — that does not explain what produced the failure.

FM11 (Observability Blindness) is both a cause and an effect here. Systems without good observability are harder to read under pressure. But it is also an output of reading failures: engineers who do not understand their systems cannot instrument them correctly. The missing metric is almost always the one the system builder did not think would matter.

When frameworks disagree — the archetype suggests a caching layer should be present, but the review questions reveal there is none — the disagreement is information. It means either the archetype is wrong (the system is not what it appears to be) or a known best practice was consciously skipped (and there is a reason worth finding). Both possibilities change the failure model.

Concept: Reading a System You Didn’t Build

Thread: T12 (Tradeoffs) ← reading breadth-first vs depth-first → right traversal for the failure type

Core Idea: Archetype identification before code reading; the seven review questions in order; failure mode enumeration from the architecture, not the log.

Tradeoff: AT9 — completeness of system model vs speed of first useful hypothesis

Failure Mode: FM11 — observability blindness; missing the metric that would have made the failure obvious

Signal: When an on-call engineer cannot form a hypothesis within ten minutes — they are depth-first on symptoms; redirect to archetype and the first three review questions

Maps to: Book 0, Frameworks 5, 6, 7

Reflection Questions

These questions are most useful when answered in writing before a team discussion, or when used as a retrospective prompt after a decision has been made.

  1. Think of the last system you inherited or debugged without full context. What was your traversal strategy? How did it serve you?
  2. If a new technical leader joined your team tomorrow, how long would it take them to pass the seven-question test on your main system? What would block them?
  3. Where in your system is the archetype violated? What was the tradeoff that produced the violation?
  4. What observability gap in your system would most slow down a competent engineer reading it under pressure at 2am?

Design: Choose a system your organisation operates that no single engineer fully understands. Write a system reading guide — using F6 (archetype identification) and F5 (seven review questions in order) — that would allow a competent engineer unfamiliar with the system to form a useful hypothesis within 90 minutes during an active incident. Identify what observability investments are required to make the guide reliable.