Distributed Key-Value Store

Introduction

In 2007, Amazon published the Dynamo paper. The company had built a shopping cart service and kept running into the same problem: a relational database, no matter how well tuned, eventually became the bottleneck. Worse, the database’s strong consistency guarantee meant that a network partition — a routine event at Amazon’s scale — could take the entire cart service down. The paper described a different trade: give up strong consistency, accept eventual consistency, and in return get a system that stays available even when nodes fail and network partitions occur. Dynamo became the blueprint for a generation of distributed key-value stores.

The distributed key-value store is the foundational read/write primitive of large-scale systems. Understanding how it works — and what it costs — is prerequisite knowledge for every other system in this book.

The Problem

Store billions of key-value pairs across hundreds of servers. Support reads and writes at hundreds of thousands of operations per second. Tolerate individual server failures without downtime. Distribute data automatically without manual shard assignment. Scale horizontally by adding nodes.

The constraints are fundamental: no single server can hold all the data, no single server can handle all the traffic, and any server can fail at any time.

Naive Approach and Why It Fails

The naive approach is a hash table: node = hash(key) % N where N is the number of nodes. This works until a node is added or removed. When N changes, almost every key maps to a different node. All cached data is invalid simultaneously. All clients must be updated. The system experiences a thundering herd as all clients reload from the database.

This is the modulo problem: simple hash-based distribution is not resilient to topology changes. At scale, topology changes are routine — nodes fail, capacity is added, hardware is replaced. An algorithm that invalidates the entire key space on every topology change is unusable.

Architecture Walkthrough

Consistent Hashing

Place all nodes on a circular hash ring. Each node owns a range of the hash space. Keys map to the ring by their hash value and are assigned to the first node clockwise from that position.

When a node is added or removed, only the keys in the adjacent range on the ring are reassigned. Adding a node with hash position 200 between Node A and Node B transfers only the keys in range [100, 200] — not the entire key space.

Virtual nodes solve hotspotting: each physical node is assigned multiple positions on the ring (typically 100–200 virtual nodes). This distributes load more uniformly and prevents a single node from owning a disproportionate slice of the ring.

Replication

Each key is stored on N nodes — the node that owns the key plus the next N-1 nodes clockwise on the ring. With N=3, any key survives the simultaneous failure of two of its three owners.

Quorum Reads and Writes

Consistency is tunable via the quorum formula: a read or write is considered successful when R read replicas and W write replicas respond, where R + W > N guarantees that at least one node has the latest write.

N = 3 (replication factor)
W = 2 (write quorum)
R = 2 (read quorum)
R + W = 4 > N = 3 → strong consistency

W = 1, R = 1 → eventual consistency, lowest latency
W = 3, R = 1 → strongly consistent reads, slow writes
W = 1, R = 3 → strongly consistent reads, fast writes

Low quorum values prioritise availability and latency. High quorum values prioritise consistency. The choice is AT1 (Consistency/Availability).

Gossip Protocol

Nodes maintain cluster membership through gossip: each node periodically contacts a random peer and exchanges state. Membership changes (nodes joining or leaving) propagate through the cluster in O(log N) rounds. No central coordinator is required.

Gossip propagation:
Round 1: Node A tells Node B [A joined]
Round 2: Node B tells Node C [A joined, B knows]
Round 3: Node C tells Node D [all three know]
...
O(log N) rounds to reach all N nodes

Conflict Resolution

With asynchronous replication, two clients can write to the same key concurrently. Conflict resolution strategies: last-write-wins (LWW) using wall clock timestamps (simple, loses updates), vector clocks (tracks causal ordering, complex), or application-level merge (most flexible, requires client logic). Dynamo uses vector clocks with application-level conflict resolution for the shopping cart — a cart merge is semantically meaningful.

Data Flow

Key Design Decisions

AT1 — Consistency/Availability: The quorum parameters W and R are runtime-configurable in Cassandra and DynamoDB. This is not a build-time choice — it is a per-operation choice. Critical operations (payment confirmation) use high quorums. High-throughput operations (view count updates) use low quorums.

AT5 — Centralisation/Distribution: Gossip-based membership eliminates the central coordinator — a SPOF (FM1). Every node can serve any request. The coordinator role is assumed by whichever node receives the request, not by a designated master.

AT9 — Correctness/Performance: Vector clocks are more correct than last-write-wins but more expensive to propagate and reason about. Most production systems accept LWW for non-critical data and use application-level logic only where correctness matters.

Failure Modes in This System

FM12 — Split-Brain: When a network partition separates nodes into two groups, each group may accept writes to the same key. When the partition heals, both versions exist. Split-brain is not a failure that can be fully prevented — it is an inherent consequence of the Consistency/Availability tradeoff. The mitigation is detecting it (vector clocks surface conflicts) and resolving it deterministically.

FM6 — Hotspotting: Virtual nodes mitigate but do not eliminate hotspotting. A key accessed by millions of clients simultaneously — a trending topic, a celebrity’s profile — concentrates load on the nodes that own that key regardless of ring distribution. Application-level caching or request deduplication is the mitigation.

FM4 — Data Consistency Failure: Eventual consistency means reads can return stale data. If an application assumes consistency where the storage layer provides only eventual consistency, it will produce incorrect results. Every eventual consistency deployment requires explicit staleness bounds and client-side handling of stale reads.

How It Evolves at Scale

At 10×: the ring has more nodes. Virtual node count increases to maintain uniform distribution. Quorum sizes may decrease to maintain acceptable write latency as the cluster grows.

At 100×: geographic distribution becomes necessary. Multi-region replication introduces cross-region latency into the quorum path. Strong consistency across regions requires W and R values that span regions — adding 50–100ms to every write. Most systems relax to per-region consistency with async cross-region replication.

The evolution path is: single cluster → multi-rack cluster → multi-data-centre cluster → multi-region cluster. Each step increases the consistency/latency tradeoff.

Real-World Variants

Amazon DynamoDB provides tunable consistency per read (eventually consistent or strongly consistent). It adds automatic partitioning, DAX (a dedicated caching layer), and global tables for multi-region replication. The operational complexity of Dynamo is hidden behind a managed service API.

Apache Cassandra exposes the quorum parameters directly. It uses leaderless replication (no primary node) and LWW by default. It excels at write-heavy workloads and time-series data. Read performance requires careful data modelling — secondary indexes are costly.

Redis Cluster uses a different sharding approach: 16,384 hash slots, each assigned to a primary node. Clients maintain a routing table and contact the correct primary directly. It provides strong consistency within a single shard but no cross-shard transactions. Suitable for caching and session storage, not general-purpose distributed storage.

Concept: Distributed Key-Value Store

Thread: T1 (Hashing) ← Book 2, Ch 3 → Ch 4 (Distributed Cache); T6 (Redundancy) ← Book 3, Ch 5 → Ch 6 (API Gateway); T9 (Consensus) ← Book 3, Ch 8 → Ch 17 (Payment Processing)

Core Idea: Consistent hashing distributes keys across nodes so that topology changes affect only adjacent key ranges. Quorum reads and writes (R + W > N) provide tunable consistency without a central coordinator.

Tradeoff: AT1 — Consistency vs Availability: tunable quorum parameters let operators shift between strong consistency (high R+W) and high availability (low R+W) per operation.

Failure Mode: FM12 — Split-Brain: network partitions allow both sides to accept conflicting writes; vector clocks surface the conflict, but application logic must resolve it.

Signal: When a system needs to store billions of key-value pairs with high write throughput and must remain available during node failures.

Maps to: Book 0, Framework 6 (System Archetypes)

Exercises

Level 1 — Understand

What consistency model does the distributed KV store provide when W=1, R=1, N=3, and why does R+W > N guarantee that at least one node has the latest write?
Name three properties of consistent hashing that make it preferable to simple modulo hashing for distributed storage.
What is the purpose of virtual nodes in a consistent hash ring, and what problem do they solve?

Level 2 — Apply

A distributed KV store has N=5 replicas. You set W=3 and R=3. (a) Does R + W > N hold? (b) Is this configuration strongly consistent? (c) If you change to W=1, R=1, what consistency guarantee does the system provide? (d) What FM code describes the failure risk of the W=1, R=1 configuration?
A consistent hash ring has three nodes at positions 100, 300, and 500 on a 0–1000 ring. A new node joins at position 200. Which keys are reassigned? Which nodes are affected? How does this compare to a simple modulo scheme with N=3 adding one node to get N=4?

Level 3 — Design

A gaming platform stores player session state (current level, inventory, score) in a distributed KV store. Sessions are read on every game action (200ms intervals), written on every level-up or purchase. 10M concurrent players. Session reads must be strongly consistent — a player must always see their latest inventory. Design the quorum configuration. What AT1 tradeoff does strong consistency impose? What FM12 risk exists if the data centre hosting the primary quorum experiences a network partition? How would you detect and recover from it?

A complete answer will: (1) specify a concrete quorum configuration — e.g., N=3 replicas with W=2, R=2 satisfying W+R>N for strong consistency — and calculate the read/write QPS required (10M players at 200ms intervals = 50M reads/sec, requiring a sharded cluster), (2) name AT1 (Consistency/Availability) and state the cost precisely: strong quorum reads block until W replicas confirm, so write latency increases and availability drops during node failures, (3) name FM12 (network partition) and explain that a partition isolating the primary replica forces the quorum to either reject reads (preserving consistency) or serve stale data (violating it) — specify which side of CAP the design chooses, and (4) propose a concrete detection and recovery mechanism — e.g., heartbeat-based leader election with fencing tokens to prevent split-brain writes, and a read-repair protocol to reconcile diverged replicas after partition heals.

Read in the book →