A user transfers $200 from savings to checking. The savings account lives on database shard A. The checking account lives on shard B. Two operations must happen: debit A by $200, credit B by $200. Both must succeed, or neither must. There is no acceptable outcome where $200 leaves savings without arriving in checking — or arrives in checking without leaving savings.
That second outcome — components that disagree about the truth — is FM4, Data Consistency Failure. And the reason it can take three months to find is that it does not crash anything. The system stays up. It just quietly holds two contradictory beliefs, and the contradiction only surfaces when someone audits the numbers.
Why One Database Makes This Easy and Two Make It Hard
On a single database, the transfer is atomic for free. The database engine has total control over all the data it manages, so "all operations complete or none do" is enforced by the engine. Across two independent shards there is no single engine in control. Each shard fails independently. Network messages between them are delayed or lost. Atomicity across independent nodes requires a protocol — and every such protocol carries costs that single-node transactions do not.
Two-Phase Commit and the In-Doubt Trap
Two-Phase Commit (2PC) is the fundamental protocol. A coordinator collects votes, then issues a decision. Phase 1, PREPARE: the coordinator asks every participant to lock its resources and vote. Phase 2, COMMIT/ABORT: if all voted commit, the coordinator tells everyone to commit.
2PC has a critical flaw, and it is the one that produces FM4. Suppose the coordinator collects all COMMIT votes, writes its decision, and then crashes — before sending a single Phase 2 message. The participants are now in-doubt. Each one has voted COMMIT and is holding its locks. It cannot commit on its own — for all it knows the coordinator decided ABORT. It cannot abort on its own — for all it knows the coordinator decided COMMIT and other participants have already committed, so aborting would violate atomicity. The participant's only option is to wait.
The consequence is not a wrong answer — it is no answer, and every resource locked in Phase 1 stays locked for the entire duration of the coordinator outage. Other transactions that touch those rows block. The outage is not bounded by the transaction; it is bounded by when the coordinator recovers. A coordinator whose storage fails permanently can leave participants blocked indefinitely.
The mitigation is one line that teams treat as an implementation detail and should not: the coordinator must write its decision to a durable write-ahead log before sending any Phase 2 message. On restart it reads the log and re-sends every unacknowledged message. Without that log, a coordinator crash after Phase 1 leaves the transaction permanently inconsistent — that is FM12 (Split-Brain), and it is FM4 made permanent. The log is not an optimisation. It is the recovery mechanism without which 2PC provides no durability guarantee at all.
2PC — the in-doubt trap that produces FM4
Coordinator Participant A Participant B
│ PREPARE ──────────▶│ │
│ PREPARE ─────────────────────────────▶│
│◀──── vote COMMIT ──│ │
│◀──── vote COMMIT ─────────────────────│
│ write decision to log
╳ CRASH (before any Phase 2 message)
│ in-doubt: │ in-doubt:
│ locks held, │ locks held,
│ cannot commit │ cannot abort
└───── wait ───────┘
The Saga Alternative
Sagas replace one distributed transaction with a sequence of local transactions, each with a compensating transaction that undoes it. An order saga: reserve inventory → charge payment → update order → notify warehouse. If charging payment fails, the compensations run in reverse: release inventory.
A saga never blocks — every participant uses its own local transaction and makes progress. But it has a fundamentally different consistency model. Between steps, the system is in an intermediate state — inventory reserved, payment not yet charged — and other transactions can observe it. In most business contexts that is fine; "inventory reserved pending payment" is a valid business state, not an arbitrary inconsistency. In a financial domain it may be unacceptable. Know which domain you are in.
The Tradeoff: Consistency vs Availability
This is AT1 (Consistency vs Availability) in its purest form. 2PC maximises consistency — the coordinator decides, everyone executes the same decision — at the cost of availability: it blocks during coordinator failure. Sagas maximise availability — always making progress — at the cost of observable intermediate inconsistency. There is no perfect option. Every distributed transaction protocol either may block, or reduces its consistency guarantee. You are choosing which.
There is also AT8 (Coupling vs Cohesion): sagas decouple services — each step uses its own service's local transaction — but the price is the observable inconsistency window between steps.
Where It Fails: Cascading Compensation
The saga's signature failure is FM2 (Cascading Failures). If a compensation fails — the payment service is down exactly when the saga needs to issue a refund — the saga cannot complete and is stuck in a partially-compensated state. If other sagas are waiting on resources this stuck saga holds — inventory items, payment slots — they fail too. A compensation failure in one saga propagates to block unrelated ones. The defences: retry compensations with exponential backoff, alert on repeated failures, and design compensations to be idempotent — safe to run multiple times — because the orchestrator may retry them.
Real Systems
Google Spanner runs a variant of 2PC, but the coordinator role is distributed across Paxos group leaders — coordinator state is replicated, so a single coordinator failure is non-blocking. This is the production answer to the in-doubt problem: a quorum holds the decision, so no single node failure can leave participants in-doubt. DynamoDB's TransactWrite API provides all-or-nothing semantics across up to 25 items, with a transaction record acting as the durable coordinator log. Stripe and Shopify use the saga pattern for payment flows — each step (fraud check, card authorisation, ledger update) has a compensation. Apache Kafka transactions use a 2PC-like protocol with a transaction-log topic as the durable coordinator log.
The One Sentence
FM4 takes three months to find because a data consistency failure does not crash — it leaves two components quietly disagreeing about the truth — and the only defences are choosing AT1 deliberately (2PC's blocking or a saga's visible intermediate state) and treating the coordinator's durable log not as an implementation detail but as the one mechanism standing between a crashed coordinator and a permanently inconsistent system.
Concept: FM4 — a data consistency failure — is two components quietly disagreeing about the truth without crashing anything.
Core Idea: Atomicity across independent nodes needs a protocol — 2PC, which can block in-doubt, or sagas, which expose an observable intermediate state.
Tradeoff: AT1 — Consistency vs Availability: 2PC maximises consistency and blocks on coordinator failure; sagas always progress but show intermediate inconsistency.
Failure Mode: FM12 — Split-Brain: a coordinator crash after Phase 1 with no durable log leaves the transaction permanently inconsistent.
Signal: When a numbers audit finds two systems disagreeing and nothing ever crashed, FM4 has been running silently.
Series: Book 3, Ch 26