Each Tradeoff in Detail

AT1 — Consistency vs Availability

The dial: During a network partition, you must choose. A consistent system refuses requests it cannot guarantee are correct. An available system responds to every request, potentially with stale or conflicting data.

The CAP theorem context: The CAP theorem (F9 #3) says this choice is forced during a partition. In the absence of a partition — most of the time — you can have both. The important word in CAP is during. Design for the partition case; default to both the rest of the time.

Setting the dial: Financial transactions, inventory counts, seat reservations — set toward consistency. Strongly consistent means no double charges, no overselling. The cost is reduced availability and higher latency (synchronous replication, quorum operations). Social feeds, product recommendations, user profiles — set toward availability. Eventual consistency means a post might appear 2 seconds late or a recommendation list is from 30 seconds ago. The cost is bounded staleness.

The failure mode: FM4 (Data Consistency Failure) and FM12 (Split-Brain) are what you pay when you set this dial incorrectly toward availability for data that required consistency.

AT2 — Latency vs Throughput

The dial: Latency is the time to complete one request. Throughput is the number of requests completed per second. Optimising for one often costs the other.

The tension: A larger batch size increases throughput (more items per operation) but increases the latency of each individual item (it must wait for the batch to fill). A write-ahead log with infrequent fsync increases write throughput but increases the window of data loss on crash. A connection pool allows high concurrent throughput but each individual connection must wait for a pool slot.

Setting the dial: User-facing read APIs — set toward latency. P99 latency matters; a user waiting 500ms for a page load will not notice whether the server handled 1,000 or 10,000 requests per second simultaneously. Batch data pipelines — set toward throughput. Processing 1 billion records overnight cares about total throughput, not the latency of record 347,291,445.

Little’s Law: Throughput = arrival rate × average latency. If latency increases, throughput decreases unless you add capacity. This relationship is non-negotiable — it is a consequence of queueing theory, not a design choice.

AT3 — Simplicity vs Flexibility

The dial: A simple system is easy to understand, easy to debug, easy to hire for. A flexible system can adapt to new requirements without restructuring. Adding flexibility usually adds abstraction — which adds complexity.

The tension: Microservices are flexible — each service can be changed, scaled, and deployed independently. A monolith is simple — one deployment, one codebase, easy to trace a request. Microservices bought you flexibility at the cost of operational complexity (distributed tracing, network failures, service discovery).

Setting the dial: Early in a product’s life — set toward simplicity. The requirements are not yet understood. Flexibility for requirements you cannot anticipate yet adds complexity that slows you down. As the product matures and requirements stabilise — the cost of rebuilding for flexibility is now justified by the concrete requirements you have accumulated.

Gall’s Law (F9 #6): A complex system that works always evolved from a simple system that worked. Start simple. Evolve toward flexibility only when the requirements demand it.

AT4 — Precomputation vs On-Demand

The dial: Pay the cost of computation at write time (precomputation) or at read time (on-demand). Precomputation makes reads fast at the cost of increased write cost and storage. On-demand makes writes cheap at the cost of slower reads.

Where this appears: A news feed that precomputes each user’s feed on every post write (fan-out-on-write) — fast reads, expensive writes. A news feed that assembles the feed from followed accounts at read time (fan-out-on-read) — cheap writes, slow reads. A search index that is precomputed — fast searches, expensive indexing. A full-table scan query — no indexing overhead, slow searches.

Setting the dial: Read-heavy systems with expensive computation — set toward precomputation. The cost is paid once per write and amortised across many reads. Write-heavy systems with infrequent reads — set toward on-demand. Paying the precomputation cost for data that is rarely read wastes resources.

The storage consequence: Precomputation always increases storage. The precomputed search index takes more space than the raw documents. The precomputed feed takes more space than storing only posts. Storage cost is part of the precomputation tradeoff.

AT5 — Centralisation vs Distribution

The dial: A central authority is easy to reason about and easy to make consistent. A distributed system has no single point of failure but requires coordination protocols.

Where this appears: A single database server (centralised) vs a distributed database cluster (distributed). A central API gateway (centralised) vs a service mesh with sidecars (distributed). A single team owning a platform component (centralised ownership) vs each team owning its slice (distributed ownership).

Setting the dial: Strong consistency requirements — centralisation is easier. One source of truth is consistent by definition. High availability requirements — distribution is safer. No single point of failure means no single failure takes down the system. The cost of distribution is the coordination overhead and the consensus protocols required to keep distributed state aligned.

The paradox: A distributed system with a centralised coordinator is still a SPOF at the coordinator. ZooKeeper, etcd, and similar systems are widely used as centralised coordinators for distributed systems — they solve the distributed coordination problem with careful SPOF management (by running the coordinator in a raft quorum).

AT6 — Generality vs Specialisation

The dial: A general-purpose component handles many use cases. A specialised component handles one use case optimally. Postgres is general-purpose. An in-memory sorted set is specialised. Using Postgres for a leaderboard works; using Redis Sorted Sets for a leaderboard is 100× faster.

Where this appears: SQL database vs specialised time-series database for metrics storage. HTTP REST API vs gRPC for high-throughput internal service communication. A general task queue vs a specialised workflow engine for complex multi-step processes.

Setting the dial: Optimise for the common case. If 90% of operations are range queries on time-series data, a specialised time-series database is justified. If 90% of operations are arbitrary queries on relational data, the general-purpose SQL database is correct. Premature specialisation adds operational complexity for a performance gain that may never be needed.

The operational cost: Every specialised component is another system to operate, monitor, and hire expertise for. The performance benefit must exceed the operational cost.

AT7 — Automation vs Control

The dial: An automated system makes decisions without human intervention — faster, more consistent, never tired. A controlled system defers decisions to humans — slower, but benefits from human judgment in novel situations.

Where this appears: Auto-scaling (automated) vs manual capacity adjustment (controlled). Automated deployment pipelines vs manual deploy approvals. Automated fraud detection vs human fraud review. Automated database failover vs manual failover with human confirmation.

Setting the dial: Routine decisions with clear criteria — set toward automation. Auto-scaling based on CPU utilisation is a routine decision that is faster and more reliable automated than manual. Novel situations with ambiguous criteria — set toward control. An automated system that flags “unusual transaction” and approves it (false negative) is worse than a human reviewing it. High-stakes, irreversible actions — set toward control. Automated database failover is acceptable; automated data deletion is not.

The failure mode: Automation failure is a scale failure. A misconfigured auto-scaling policy that provisions 10,000 instances instead of 10 is an automation failure that would be caught immediately under manual control. Automation amplifies both correct and incorrect decisions.

AT8 — Coupling vs Cohesion

The dial: High cohesion means related things are together — a module owns a clear, single responsibility and all logic for it lives in one place. Low coupling means things are independent — changing one component does not require changing another. The tension: high cohesion often requires knowledge of a component’s internals, which creates coupling to those internals.

Where this appears: A monolith has high cohesion (all related code in one place) and high coupling (many components depend on many others). Microservices have low coupling (independent deployment) and potentially low cohesion (business logic split across multiple services). The correct architecture is high cohesion within a service and low coupling between services.

Setting the dial: Within a bounded context — set toward cohesion. All logic for the payment domain lives in the payment service. Between bounded contexts — set toward low coupling. The payment service does not know about the shipping service.

Conway’s Law (F9 #1): Team structure determines coupling. Services owned by the same team will be coupled — the team’s communication flows through the code. Services owned by different teams will be loosely coupled — the team boundary is enforced by the organisational boundary.

AT9 — Correctness vs Performance

The dial: A correct system always produces the right answer. A performant system produces the answer quickly. When computation is expensive, these conflict: approximation algorithms sacrifice correctness for speed; exact algorithms sacrifice speed for correctness.

Where this appears: Bloom filters — can say “definitely not in set” but may produce false positives (approximate correctness for dramatic space savings). HyperLogLog — approximate distinct count with 1-2% error versus exact count requiring O(n) memory. Nearest neighbour search — approximate (HNSW index, sub-linear time) versus exact (full scan, linear time). Driver matching in ride-sharing — approximate greedy matching in milliseconds versus exact optimal matching that is NP-hard.

Setting the dial: The question is: what is the cost of the wrong answer? In financial settlement, the cost of an incorrect answer is high — use exact algorithms regardless of performance cost. In search ranking, the cost of a slightly suboptimal ranking is low — use approximate algorithms to meet latency budgets.

Approximation is not failure: A recommendation engine that is 95% as good as the optimal recommendation at 0.1× the computational cost is not a compromise — it is a correct engineering decision. The tradeoff is explicit and the answer is the right engineering choice for the requirements.

AT10 — Synchronous vs Asynchronous

The dial: A synchronous operation blocks the caller until the operation completes. An asynchronous operation returns immediately; the result is delivered later. Synchronous is simple to reason about. Asynchronous is complex but non-blocking.

Where this appears: An API that returns the result of a payment synchronously (caller waits) vs an API that returns a payment reference and delivers the result via webhook (caller is notified). A Kafka consumer that commits offsets synchronously after processing each message (slower, no duplicate risk) vs committing asynchronously in batches (faster, small duplicate risk on failure).

Setting the dial: When the caller needs the result to continue — use synchronous. When the caller does not need the result immediately, or the operation takes too long to block on — use asynchronous. The key question is: does the caller actually need to wait, or have they been waiting by default?

The complexity cost: Asynchronous systems are harder to debug (the request and the result are separated in time), harder to trace (the call stack is broken at the async boundary), and harder to test (you must wait for asynchronous delivery in tests). The latency benefit must justify this complexity.

Read in the book →

← The 10 Architecture Tradeoffs — Layer 1 (Recall Triggers) How Items Connect — The Decision Protocol →