Each Principle in Detail

P1 — Abstraction

The principle: A good abstraction hides the complexity that callers do not need to know about, exposes exactly the interface they need, and does not change its contract when the implementation changes. The key word is “does not leak.”

Why it matters: Every abstraction that leaks forces all its callers to understand the implementation they were supposed to be shielded from. A database connection pool that leaks its max-connections setting into the application forces the application to reason about connection pool management — which is the connection pool’s job.

The failure it prevents: Tight coupling. When callers depend on implementation details, every implementation change requires caller changes. The cost of change grows proportionally to the number of things that know too much.

Where it breaks: The Law of Leaky Abstractions (F9 #13) says that every non-trivial abstraction will eventually leak. The principle is not to eliminate leakage — it is to delay it as long as possible by designing contracts that are stable even as implementations evolve.

P2 — Modularity

The principle: A modular system is composed of units that can be understood, built, tested, and deployed independently. The module boundary defines what is inside (its responsibility) and what is outside (everything else).

Why it matters: You cannot test what you cannot isolate. You cannot deploy safely what you cannot deploy independently. Module boundaries are the unit of ownership, the unit of testing, and the unit of deployment. If module boundaries are unclear, all three break.

The failure it prevents: Content coupling — where components directly access each other’s internals. When coupling is tight, a change anywhere affects everything. Velocity degrades proportionally to coupling.

Where it breaks: When modules share a database. The database becomes a hidden coupling point. Two services that look independent but share tables are not actually independent — they share a state boundary, which means they share a deployment boundary.

P3 — Composability

The principle: Well-designed components can be combined to produce new behaviours without surprising interactions between them. Unix pipes are the canonical example: each command does one thing; composing them produces arbitrary workflows.

Why it matters: Systems grow by composition. New features are added by combining existing components in new ways. If components have hidden side effects or implicit dependencies, composition produces unexpected results.

The failure it prevents: Action-at-a-distance bugs. When component A is composed with component B and something in component C breaks, the source of the problem is invisible to anyone who does not know about the hidden dependency.

Where it breaks: Stateful components that depend on a specific invocation order. A component that must be initialised before it can be composed, or that behaves differently depending on what was called before it, is not composable. It is just tightly coupled with extra steps.

P4 — Separation of Concerns

The principle: Each component has one clearly defined responsibility. It does not mix concerns from different domains. Business logic does not contain database queries. Controllers do not contain authorisation logic. Payment processing does not contain email sending.

Why it matters: Mixed concerns produce components that change for multiple reasons and are tested with multiple scenarios in mind. When a single class handles business logic, database access, and logging, changing the logging framework requires re-testing the business logic.

The failure it prevents: The “god object” — a component that has grown to own everything, cannot be tested independently, and is changed by every engineer for every new feature.

Where it breaks: When the concern boundary is drawn incorrectly. A microservice that has separated by team rather than by domain concern may enforce separation in the wrong place — creating distributed monolith problems where the services are decoupled in deployment but coupled in semantics.

P5 — Idempotency

The principle: Applying the same operation multiple times produces the same result as applying it once. POST /payments with the same idempotency key produces one charge, not N charges, regardless of how many times the request is sent.

Why it matters: Networks are unreliable. Requests can be delivered more than once. Consumers can process messages more than once. If the operation is not idempotent, retries and at-least-once delivery semantics produce duplicate effects: double charges, duplicate emails, double inventory decrements.

The failure it prevents: Silent Data Corruption (F3 #9) from duplicate processing. The idempotency key pattern is the standard mechanism: client generates a unique key for each logical operation; server stores (key → result); duplicate requests return the stored result.

Where it breaks: When idempotency keys are not scoped correctly. An idempotency key that is reused across different operations by mistake converts multiple distinct operations into a single idempotent one — silently dropping the subsequent operations.

P6 — Reproducibility

The principle: The same inputs produce the same outputs, every time. Tests produce the same results in CI as on a laptop. Deployments produce the same behaviour in staging as in production. Builds produce the same binary from the same source.

Why it matters: Non-reproducible systems cannot be debugged. A bug that only happens in production, with data that cannot be reproduced in staging, has no reliable fix path. A build that produces different binaries from the same source cannot be audited or rolled back safely.

The failure it prevents: Heisenbugs — bugs that disappear when you look at them. The financial exchange incident at the start of this chapter: legacy code that was activated in production but had not been tested because the test environment did not match production.

Where it breaks: Hidden sources of non-determinism: timestamp calls inside business logic, random number generators without fixed seeds in tests, environment variables that differ between environments, race conditions in test infrastructure.

P7 — Immutability

The principle: Once created, data does not change. New state is represented by new data, not by mutating existing data. Event logs, append-only stores, and functional data structures all implement this principle.

Why it matters: Mutable shared state is the source of almost every concurrency bug. If data cannot change, multiple readers can access it simultaneously without locks. If past state is preserved (not mutated), you have an audit trail and the ability to replay events.

The failure it prevents: Data Consistency Failure (F3 #4) from concurrent mutation. The event sourcing pattern (Thread T7) is immutability at the infrastructure level — the event log is append-only; current state is derived by replaying events.

Where it breaks: When immutability is applied at the wrong layer. Making every data structure immutable in a high-throughput write path creates excessive garbage collection pressure. Immutability must be applied where it provides the most correctness benefit, not uniformly everywhere.

P8 — Locality

The principle: Data and the compute that processes it should live as close together as possible. Network hops are expensive. Disk reads are expensive. Reduce the distance between data and compute.

Why it matters: Every network call adds latency: typically 0.5ms within a datacenter, 100ms across the internet. A service that makes ten synchronous downstream calls has a minimum latency of 10× the slowest downstream call. Locality is the principle that keeps latency under control.

The failure it prevents: Latency Amplification (F3 #5) — the accumulation of small latencies across many hops into a large total latency that violates the SLO.

Where it breaks: When locality optimisation creates coupling. Colocating a service with its data may mean they must be deployed together — losing the independent deployability that modularity was supposed to provide. Locality and modularity are in tension; the tradeoff must be named (F4 #8: Coupling vs Cohesion).

P9 — Fault Tolerance

The principle: The system continues operating correctly when individual components fail. Not “avoids failure” — tolerates it.

Why it matters: In a distributed system with many components, at any given moment some subset of them are degraded or unavailable. A system designed under the assumption that all components are healthy will produce cascading failures when one fails. A fault-tolerant system has designed for failure as the expected condition.

The failure it prevents: Cascading Failures (F3 #2) — where one component failure propagates to its dependants, which propagate to their dependants, producing a full outage from a single root cause.

Where it breaks: When fault tolerance mechanisms themselves are not fault tolerant. A circuit breaker that fails open rather than closed, a retry policy that amplifies load rather than reducing it, a fallback that produces stale data without disclosing this to the caller — all are fault tolerance patterns that create new failure modes.

P10 — Observability

The principle: You can tell what the system is doing from the outside, without modifying it. Metrics, logs, and traces are the three pillars. A system that cannot be observed cannot be debugged, cannot be optimised, and cannot have its SLOs enforced.

Why it matters: Observability Blindness (F3 #11) is not just a missing metric — it is an inability to determine what is happening in the system. During an incident, a system that cannot be observed may produce symptoms that are entirely disconnected from the root cause.

The failure it prevents: Observability Blindness (F3 #11) — the failure mode where the system is breaking and you cannot tell where or why.

Where it breaks: When observability is added retroactively. Adding logging and metrics to an existing system that was not designed for them is significantly harder than building them in. The principle is most powerful when treated as a first-class design requirement.

P11 — Consistency

The principle: The system behaves the same way under the same conditions. The same request produces the same response. The same data in produces the same data out. Inconsistency is the enemy of debuggability and user trust.

Why it matters: Inconsistent systems are harder to test (different runs produce different results), harder to debug (the bug does not reproduce reliably), and harder to trust (users cannot predict what will happen).

The failure it prevents: Data Consistency Failure (F3 #4) — the class of failure where different components have different views of the same data, producing contradictory behaviour.

Where it breaks: In distributed systems, strong consistency has a cost. CAP (F9 #3) says during a network partition you must choose between consistency and availability. Eventual consistency is the deliberate relaxation of this principle to gain availability. The principle is not “always be consistent” — it is “be explicit about the consistency model you have chosen and design for it.”

P12 — Security Boundaries

The principle: Trust is not assumed; it is verified at every crossing. A security boundary is a point at which identity is verified and permissions are checked. Components on the same side of a boundary share a trust level. Crossing a boundary requires re-verification.

Why it matters: The most common security failures come from implicit trust that was not designed as explicit trust. Internal services that assume that any request from the internal network is authorised are one misconfigured network rule away from a breach.

The failure it prevents: Security Breach (F3 #10) via privilege escalation or unauthorised access.

Where it breaks: When security boundaries are coarse. A single boundary around the perimeter of the system — “everything inside is trusted” — means that any component breach inside the perimeter grants access to everything else. Zero Trust architecture applies the principle at every service-to-service call, not just at the perimeter.

P13 — Fail Fast

The principle: Detect errors at the earliest possible point and stop rather than propagating incorrect state. A database migration that fails validation should not run. A payment request with a missing field should return a 400 immediately, not a 500 after partial processing.

Why it matters: Errors caught early are cheap. Errors caught late — after incorrect state has propagated through multiple systems — are expensive and sometimes unrecoverable. The financial exchange incident: if the system had detected anomalous order flow and stopped within seconds, the exposure would have been millions, not billions.

The failure it prevents: Silent Data Corruption (F3 #9) — incorrect state that propagates silently because no component checked its assumptions.

Where it breaks: When Fail Fast is applied too aggressively at runtime rather than during design. A system that halts on any unexpected input will have poor availability in environments with legitimate edge cases. The principle applies most strongly to startup validation, migration validation, and explicit precondition checks in critical paths.

P14 — Least Privilege

The principle: Every component has only the access it needs to do its job and no more. The API gateway does not need write access to the user database. The read replica does not need to accept write connections. The batch job does not need access to production customer data.

Why it matters: Privilege not granted cannot be exploited. When components have broader access than needed, a compromise of that component gives an attacker access to more than the component needs. The blast radius of any breach is bounded by the privileges of the breached component.

The failure it prevents: Security Breach (F3 #10) via privilege escalation — an attacker who compromises a low-privilege component can access only what that component can access.

Where it breaks: When least privilege is not implemented at the infrastructure level. An IAM policy that grants S3:* because it was easier than specifying the exact bucket operations needed is a least-privilege failure. Most least-privilege failures are operational convenience choices, not design decisions.

P15 — Measure & Adapt

The principle: Systems are improved empirically, not by assumption. You do not know that a caching layer will improve performance until you measure it. You do not know that a new algorithm is faster until you benchmark it on production-like data.

Why it matters: Assumptions about system performance are reliably wrong. Engineers systematically misidentify bottlenecks without measurement. The optimisation you spent two weeks implementing may have been in a path that represents 0.1% of total load.

The failure it prevents: Premature optimisation producing complexity without benefit. Goodhart’s Law via Measure & Adapt: measure the right things, and improvement follows. Measure the wrong things, and the system optimises for the metric rather than the outcome.

Where it breaks: When measurement changes the system being measured. A profiler that adds 10% overhead changes the performance profile. An experiment with too small a sample produces statistically insignificant results that are treated as definitive.

Read in the book →

← The 15 Principles — Layer 1 (Recall Triggers) How Items Connect — Principle to Failure Mode →