AT1 — Consistency vs Availability: The Decision That Defines Your System

Your payment service and your social feed cannot make the same consistency choice. One must never show a wrong balance. The other must never show a loading spinner. These requirements are mutually exclusive during a network partition.

This is not a philosophy question. It is an engineering decision with a measurable cost in each direction.

The Tradeoff

Consistency vs availability is a dial, not a switch.

Consistent ◄────────────────────────────────────► Available

  Strong      Bounded     Session      Read-your    Eventual
  Consistency Staleness   Consistent   Writes       Consistency

  (every read  (reads up   (you see     (you see    (replicas
  is current)  to N sec    your own     your own    converge
               old)        writes)      writes)     eventually)

At the consistent end: every read returns the most recent write, guaranteed. At the available end: every request gets a response, guaranteed. During normal operation, you can have both. During a network partition, you cannot. The CAP Theorem proves this. Engineering is choosing where on the dial each of your data stores sits.

Side A: Consistency

A consistent system guarantees that every read reflects the most recent write. Two users reading the same data at the same time see the same value.

The cost is threefold.

Higher write latency. A consistent write must propagate to a quorum of replicas before acknowledging. A three-node cluster with quorum writes waits for two nodes to confirm. If one node is slow, the write waits.

Reduced availability during partitions. When the network splits, the minority partition cannot form a quorum. It rejects writes. Clients connected to the minority side see errors.

Coordination overhead. Consensus protocols like Raft require message rounds between nodes on every write. This adds latency proportional to network round-trip time between replicas.

The failure mode when you need consistency but do not enforce it: two nodes accept conflicting writes. The inventory says zero on one replica and five on another. A customer buys an out-of-stock item. The damage is financial and operational.

Side B: Availability

An available system responds to every request. No errors. No timeouts. The user always sees something.

The cost is also threefold.

Stale reads. A write to node A has not yet replicated to node B. A read from node B returns the old value. The staleness window depends on replication lag.

Conflicting writes. During a partition, both sides accept writes. When the partition heals, those writes must be reconciled. Last-write-wins is simple but lossy.

Eventual convergence. "Eventually consistent" means all replicas converge to the same value — eventually. The window between write and convergence is undefined. For some systems, milliseconds. During a partition, hours.

The failure mode when the team does not know which consistency model is active: the system serves stale data and no dashboard shows it. Users report wrong values. Engineers cannot tell whether the data is stale, corrupted, or correct-but-delayed.

The Critical Case: Marketplaces

Marketplace systems expose this most painfully. A marketplace has two data domains with opposite requirements.

Payments must be consistent. A double-charge destroys user trust. Payment data must use strong consistency: quorum writes, serializable transactions, no stale reads on balance checks.

Product catalog can be available. A product listing that shows a price from 30 seconds ago is acceptable. A catalog search that returns results during a partition is better than returning an error.

The marketplace forces you to run both sides of the dial in the same system. This is why marketplace architectures split their data layer: a strongly consistent store for transactions and an eventually consistent store for browsing.

Choosing one model for the entire system fails in both directions. Consistent everywhere: the catalog goes down during minor partitions, users cannot browse, revenue drops. Available everywhere: the payment system serves stale balances, double-charges occur, refund costs mount.

Three Deciding Questions

Before choosing a consistency model for any data store, answer these:

1. What is the cost of a wrong read? If a wrong read means a customer is charged twice, you need consistency. If it means a user sees yesterday's follower count, you need availability.

2. What is the cost of an unavailable read? If an unavailable checkout page during a flash sale costs revenue, availability matters. If an empty analytics dashboard during a brief outage costs nothing, consistency is fine.

3. How often do partitions occur in this network path? A single-region deployment partitions rarely. A multi-region deployment over the public internet partitions frequently. Rare partitions favor consistency. Frequent partitions favor availability.

These answers are different for every data store in the system.

The Unnamed Tradeoff

When the tradeoff is unnamed, the team does not know what their system chose. The database has a default consistency model. The application assumes a different one. During a partition, the behavior surprises everyone.

Every storage architecture document should contain one sentence: "During a network partition, this data store chooses [consistency / availability / bounded staleness], and here is why that fits this use case."

If that sentence is missing, the tradeoff is unnamed. Unnamed tradeoffs surface as incidents.


Concept: AT1 — Consistency vs Availability

Tradeoff: AT1 — consistency guarantees correct reads at the cost of write latency and partition-time errors; availability guarantees responses at the cost of stale or conflicting data

Failure Mode: FM4 — conflicting writes corrupt state when consistency is needed but not enforced; FM11 — the team cannot diagnose failures when the consistency model is unnamed

Signal: When someone says "we need both consistency and availability," ask what happens during a network partition

Series: Book 4, Ch 1