CAP Theorem: Consistency vs Availability

Consistency, Availability, Partition Tolerance. Choose two.

Every engineer has heard this. Most treat it as a general principle — something to mention in system design interviews before moving on. The ones who have operated distributed systems at scale treat it differently. To them it is not a principle. It is a wall. Every distributed database, every replication strategy, every caching layer eventually hits it.

The reason CAP theorem matters is not that it tells you what to build. It is that it tells you what you cannot have. And you cannot negotiate with it.

What category of law this is

CAP theorem is a mathematical constraint — formally proven by Seth Gilbert and Nancy Lynch at MIT in 2002, building on Eric Brewer's conjecture from 2000. It belongs in the same category as Amdahl's Law and Little's Law. Not a rule of thumb. Not a best practice. Not a design guideline. A theorem. The proof exists. The constraint is real.

This matters because engineers sometimes treat CAP as something to work around with clever architecture. You cannot work around a proven theorem. You can only understand which side of the constraint your system is on.

What the theorem actually says

Three properties. Any distributed system can guarantee at most two simultaneously:

Consistency — every read returns the most recent write, or an error. Not "eventually" — immediately. Every node in the system sees the same data at the same time. If you write to node A and then read from node B, node B returns what you just wrote.

Availability — every request to a non-failing node receives a response. Not a guarantee that the response contains the most recent data — just a guarantee that you get a response. The system never refuses to answer.

Partition Tolerance — the system continues operating when network partitions occur. A network partition is when some nodes cannot communicate with other nodes. Messages are dropped. The network splits into two islands that cannot see each other.

The theorem states: when a partition occurs, you must choose between consistency and availability. You cannot have both.

The critical clarification most engineers miss: partition tolerance is not optional. Networks partition. Hardware fails. Cables are cut. Cloud availability zones lose connectivity. If your system stores data on more than one machine, partitions will happen. You are not choosing between all three properties. You are choosing between consistency and availability when a partition occurs.

CAP is really a binary choice: CP or AP.

What CP and AP mean in production

CP systems — Consistency over Availability

When a partition occurs, a CP system refuses to serve requests it cannot guarantee are consistent. It returns an error or times out rather than return potentially stale data.

The production implication: your system becomes unavailable during a partition. Users see errors. Requests fail. But every successful response is correct — no stale data, no conflicting writes, no phantom reads.

When to choose CP: payment systems, inventory management, financial ledgers, any system where serving wrong data is worse than serving no data. A banking system that shows a user the wrong balance is more dangerous than one that returns an error and asks them to try again.

Real systems that choose CP: HBase, Zookeeper, etcd, most SQL databases configured with synchronous replication.

AP systems — Availability over Consistency

When a partition occurs, an AP system continues serving requests from whatever data it has locally, even if that data may be stale. It never returns an error due to a partition — it returns its best current answer.

The production implication: your system stays up during a partition. Users get responses. But those responses may be based on data that has not received recent writes from the other side of the partition. After the partition heals, the system reconciles the diverged state — this is called eventual consistency.

When to choose AP: social media feeds, product catalogues, DNS, caching layers, any system where brief staleness is acceptable and availability is critical. A social feed that shows a post from two minutes ago rather than two seconds ago is acceptable. A feed that returns an error is not.

Real systems that choose AP: Cassandra, DynamoDB (by default), CouchDB, most CDNs.

The failure mode on each side

FM4 — Data Consistency Failure on the AP side

An AP system during a partition accepts writes on both sides independently. Node A accepts a write. Node B accepts a different write to the same key. When the partition heals, both writes exist. The system must reconcile them — last-write-wins, vector clocks, application-level merge logic. If reconciliation is wrong or missing, you have silent data corruption: the database contains data that no user wrote and that violates application invariants.

The signal: after a network event, users report seeing data they did not write, or data they wrote has disappeared.

FM12 — Split-Brain on the CP side

A CP system during a partition must decide which partition is authoritative. If it gets this wrong — if both partitions believe they are the leader and continue accepting writes — you have split-brain: two nodes both think they are correct, both accepting writes, both unaware of each other's state. When the partition heals, there is no safe reconciliation. Data is lost or corrupted.

The signal: after a partition heals, two nodes have conflicting state with no clear winner.

Properly implemented CP systems prevent split-brain by requiring a quorum — a majority of nodes must agree before any write is accepted. If a partition leaves a node with fewer than half the cluster, that node stops accepting writes. It cannot form a majority. This is the correct behaviour: better to be unavailable than to risk split-brain.

The three questions that determine which side to choose

Question 1: What is the cost of serving stale data?

If a user sees a product listed as available when it sold out thirty seconds ago, the cost is an embarrassing error message at checkout. Acceptable. If a user sees an account balance that does not reflect a recent withdrawal, the cost is a potential fraud vector. Not acceptable.

Map your data to this spectrum. Most data in most applications is closer to the product catalogue end than the bank balance end.

Question 2: What is the cost of returning an error?

If your checkout flow returns an error during a network partition, users cannot complete purchases. That is direct revenue loss. If your analytics dashboard returns an error during a network partition, analysts see a blank screen for a few minutes. That is inconvenience.

Most applications have mixed data — some requires CP, some can tolerate AP. Modern distributed databases like Cassandra allow per-operation consistency levels. You can require strong consistency for writes to critical fields while allowing eventual consistency for reads of less critical data.

Question 3: How often do partitions occur in your infrastructure?

In a single data centre with quality hardware, partitions are rare — perhaps once a year. In a multi-region deployment, partitions between regions are routine — every few weeks. The tradeoff has different weight depending on how often you will actually face it.

What the theorem does not say

CAP theorem describes behaviour during partitions. It says nothing about behaviour when there is no partition — which is most of the time. When the network is healthy, you can have both consistency and availability. The choice only matters when something breaks.

This is why the theorem is often misapplied. Engineers design their entire system for the partition case — accepting latency penalties and coordination overhead — when their infrastructure rarely partitions. The correct application is to understand which partition case you are in, design for it, and not over-engineer for edge cases that your infrastructure handles well.

The PACELC extension to CAP addresses this: even without a partition, there is a latency-consistency tradeoff. Stronger consistency requires more coordination between nodes, which adds latency. This is the tradeoff that affects your system every request, not just during partitions.

The signal that tells you this applies to your system

You have a distributed database. A deployment fails halfway through. Half your nodes have the new schema. Half have the old one. Reads from different nodes return different shapes of data. Your application crashes on the inconsistent response.

Or: a network event separates your primary from your replica for forty seconds. During those forty seconds, forty writes went to the primary. The replica served reads from its stale copy. A user checked their order status twice — once during the partition, once after. They got two different answers.

Both scenarios are CAP theorem in production. Neither is a bug in the usual sense. They are the theorem applying to your system. The question is whether you designed for them.

CAP Theorem: The Constraint That Defines Every Distributed System