Each Question in Detail

Q1 — How does it scale?

What the question covers: Write volume, read volume, data volume, and the growth curve of each. At current load, the system may work. The question is whether it continues to work at 2×, 10×, 100× current load.

What to ask specifically: - What is the current and projected QPS for reads and writes? - What is the data volume now and in 12 months? - Which component is the first to saturate? (This is the bottleneck.) - Does throughput scale linearly with machines, or does coordination overhead reduce the scaling efficiency? - Which operations are O(n) in the number of users or items?

The failure it prevents: FM3 (Unbounded Resource Consumption), FM6 (Hotspotting). Systems that work at launch and fail at scale are the most common deployment failures. The failure was always visible in the design — the question just was not asked.

Example application: A social graph query that finds all mutual friends between two users. Correct at 100 users. At 100 million users with a naive implementation (O(n²) comparison of friend lists), this query brings down the database. Asking Q1 during design reveals this before it happens.

Q2 — How does it fail?

What the question covers: Component failure, network failure, dependency failure, and the propagation of each. Every component in the system will eventually be unavailable. The question is whether the system continues to function when it is.

What to ask specifically: - What happens when each component is unavailable for 1 second? 1 minute? 10 minutes? - What happens when each dependency is slow (high latency) rather than down? - What data is at risk of loss when a component fails? - Is there a SPOF? If so, is it acceptable given the availability SLO? - What is the cascade path if the most depended-upon component fails?

The failure it prevents: FM1 (SPOF), FM2 (Cascading Failures), FM9 (Silent Data Corruption). The payment system failure at the start of this chapter was a Q2 failure — the slow dependency case was not analysed.

The distinction: Ask about slow dependencies, not just unavailable ones. Unavailability produces immediate errors that are usually handled. Slowness fills queues and thread pools silently, producing a cascade only when the resource is exhausted.

Q3 — Where is the state?

What the question covers: Every piece of mutable state in the system — where it is stored, who owns it, what happens to it when a component fails, and how consistency is maintained across multiple owners.

What to ask specifically: - What are all the stores of mutable state? (Database, cache, in-memory, message queue offsets, local files) - For each: what is the consistency model? (Strong, eventual, none) - For each: what is the durability guarantee? (Survives process restart? Node failure? Datacenter failure?) - Are any two components allowed to modify the same state? - Is any state stored in-process (not externalised)? If so, what happens when the process restarts?

The failure it prevents: FM4 (Data Consistency Failure), FM12 (Split-Brain). State that is not explicitly managed is state that will produce unexpected behaviour under failure conditions.

The horizontal scaling consequence: Stateless services scale horizontally without coordination. Stateful services require routing consistency or external state externalisation. Q3 is the question that determines whether a service can scale horizontally without architectural change.

Q4 — What is the latency budget?

What the question covers: The latency requirements of each user-facing operation, and whether the proposed architecture can meet them given the sum of all component latencies on the critical path.

What to ask specifically: - What is the P50, P95, P99 latency SLO for each operation? - What is the critical path for each operation? (Trace every synchronous hop.) - What is the latency budget for each hop on the critical path? - What happens at P99 when one hop is at its slowest expected latency? - Are there sequential synchronous calls that could be parallelised?

The failure it prevents: FM5 (Latency Amplification). A chain of five synchronous service calls, each at P99 taking 100ms, produces a P99 latency of 500ms for the caller — without any single component being slow.

The tail latency trap: Tail latency (P99, P99.9) is almost always worse than the average by a factor of 3–10×. Design for tail latency, not average latency. The user who experiences the worst case is the one who leaves a negative review.

Q5 — How does it evolve?

What the question covers: How the system will change over the next 12–24 months, and whether the current design supports that change without a rewrite.

What to ask specifically: - What are the most likely changes to requirements in the next 12 months? - Can the schema be changed without downtime? (Expand-contract migration?) - Can individual components be updated and deployed independently? - Are there any decisions that, once made, are very difficult to reverse? (Database choice, message format, API contract) - Is the abstraction boundary in the right place to absorb anticipated change?

The failure it prevents: FM8 (Schema / Contract Violation), architectural rigidity (the technical debt that makes every future change expensive). The cost of a wrong architecture decision grows proportionally to how long it remains in production.

The dependency rule: Code dependencies must point in the direction of stability. Stable abstractions should be depended upon; unstable implementations should depend on stable interfaces. A change in the database schema should not require a change in the business logic.

Q6 — How is it secured?

What the question covers: Authentication (who are you), authorisation (what are you allowed to do), data in transit (is it encrypted), data at rest (is it protected), and the blast radius of a component compromise.

What to ask specifically: - How does each component verify the identity of its callers? - What is the permission model? Can a misconfigured caller access data it should not? - Is all data in transit encrypted (TLS)? - What data is sensitive? Is it encrypted at rest? - If this component is compromised, what does an attacker gain access to? - Is this following the principle of Least Privilege? Does each component have exactly the access it needs?

The failure it prevents: FM10 (Security Breach). Security reviews after design are expensive — they require changes to data models, APIs, and deployment infrastructure. Security in the design is cheap — it shapes the architecture from the start.

The Zero Trust principle: Never assume that a request is authorised because it comes from inside the network. Verify identity and authorisation at every service-to-service call.

Q7 — Can we observe it?

What the question covers: Whether the system produces sufficient signals for the team to understand its behaviour in production, diagnose problems, and enforce SLOs.

What to ask specifically: - What metrics are emitted? Are all SLO-relevant operations instrumented? - Are logs structured (JSON, key-value) and indexed? - Is distributed tracing in place? Can a single request be traced across all services it touches? - What alerts exist? Do alerts fire before the SLO is violated, not after? - How long would it take to identify the root cause of a P1 incident with the current observability?

The failure it prevents: FM11 (Observability Blindness). The observability question must be asked during design, not retroactively. Instrumenting an existing system is an order of magnitude harder than designing for observability from the start.

The minimum viable stack: One metric system (Prometheus, Datadog, CloudWatch), one structured log aggregator (Elasticsearch, Splunk, CloudWatch Logs), one distributed tracing system (Jaeger, Zipkin, X-Ray). All three are required. None is a substitute for another.

Read in the book →

← The 7 Questions The Optional 8th Question: What Does It Cost? →