Range partitioning assigns contiguous ranges of the key space to each shard. A routing table maps ranges to nodes.
// Range partition routing table
routing_table = [
{min: "A", max: "M", node: shard_1},
{min: "N", max: "Z", node: shard_2},
]
function route(key):
for range in routing_table:
if range.min <= key <= range.max:
return range.node
raise RoutingError("key out of range")
// Range scan: efficient — all keys in a range live on the same shard
function range_scan(start_key, end_key):
affected_shards = routing_table.filter(overlaps(start_key, end_key))
results = []
for shard in affected_shards:
results += shard.scan(start_key, end_key)
return merge_sorted(results)
The problem: if keys are generated sequentially (timestamps, auto-increment IDs), all writes go to the shard that owns the current key range. That shard receives 100% of write traffic while the others sit idle.
Hash partitioning maps each key to a shard by applying a hash function.
num_shards = 4
function route(key):
shard_index = hash(key) % num_shards
return shards[shard_index]
// Write: distributed evenly for random keys
function write(key, value):
shard = route(key)
shard.put(key, value)
// Point lookup: O(1) — hash the key, go directly to the shard
function read(key):
shard = route(key)
return shard.get(key)
// Range scan: NOT possible without scanning all shards
// Adjacent keys hash to arbitrary shards — there is no locality
function range_scan(start_key, end_key):
results = []
for shard in shards: // must scan ALL shards
results += shard.scan_range(start_key, end_key)
return merge_sorted(results)
The rebalancing problem: when num_shards changes from 4
to 5, the formula hash(key) % num_shards gives a different
result for almost every key. Nearly all data must be moved.
Consistent hashing places both keys and nodes on a circular ring (hash space 0 to 2^32). A key is assigned to the first node clockwise from its hash position.
// Ring with virtual nodes for balance
ring = SortedMap() // maps hash positions to node identifiers
function add_node(node, virtual_node_count=150):
for i in 0..virtual_node_count:
position = hash(node.id + ":" + i)
ring[position] = node
function route(key):
key_position = hash(key)
// Find the first node clockwise from key_position
node_position = ring.find_first_ge(key_position)
if node_position is None:
node_position = ring.first() // wrap around
return ring[node_position]
// When a new node joins: only the keys between
// (new_node - 1 predecessor) and new_node move.
// Expected fraction of keys that move = 1 / (num_nodes + 1)
function add_node_and_rebalance(new_node):
add_node(new_node)
predecessor = ring.predecessor(new_node.position)
// Only keys in (predecessor, new_node] need to move
migrate_keys(from: next_node_after(new_node),
to: new_node,
key_range: (predecessor.position, new_node.position])
When a node joins, only 1/N of keys move on average — compared to near-total movement for modulo hashing. Cassandra, Amazon DynamoDB, and Riak use consistent hashing.
A hot shard occurs when one partition receives disproportionate traffic — reads, writes, or both.
// Scenario: celebrity user with 50 million followers
// All reads for that user's feed route to the same shard
function get_feed(user_id):
shard = route(user_id) // always shard_3 for user "celebrity_A"
return shard.read_feed(user_id)
// Shard 3 receives millions of requests/second
// Other shards sit at 5% utilisation
// Mitigation 1: key splitting — append a random suffix
function write_celebrity_data(user_id, data):
suffix = random_int(0, 9)
shard_key = user_id + ":" + suffix // spreads across 10 shards
route(shard_key).write(data)
// Mitigation 2: read replicas for hot keys
function read_hot_key(key):
replicas = hot_key_registry.get_replicas(key)
chosen = replicas[random_int(0, len(replicas) - 1)]
return chosen.read(key)