Hashing: From Modular Arithmetic to Consistent Hashing in Production

A hash function, a load balancer, and a database shard walk into production. They are the same algorithm.

The function takes a large input space and maps it to a small output space. Every layer of the computing stack uses this operation. The shape never changes. The consequences of getting it wrong escalate with each layer.

The Mathematical Root

Hashing starts as modular arithmetic. Take any integer k. Compute k mod N. The result is always between 0 and N-1.

This is a many-to-one mapping. Many inputs collapse to the same output. That collision is not a bug. It is the fundamental property. The entire design question at every layer is: what happens when two inputs land on the same output?

The Algorithm Level

A hash table maps keys to array positions using h(k) mod N. Average lookup: O(1). This makes hash tables the default choice for key-value access.

Collisions resolve through chaining or open addressing. The load factor determines when performance degrades. At low load factors, O(1) holds. Past 0.75, collision chains grow. Resizing doubles the table and rehashes every key.

That rehash moves every existing entry. The cost is O(n). The system absorbs it because the table lives in one process on one machine. This assumption breaks at the next layer.

The Infrastructure Level

Distributed caches run the same function: h(key) mod N, where N is the number of cache nodes.

Add one node. N changes from 10 to 11. The function h(key) mod 11 produces different results for nearly every key. Approximately 90% of cached data now maps to the wrong node. Every request misses cache and hits the database.

Naive hashing: add node 11 to a 10-node cluster

Before: h("user:42") mod 10 = 7  →  node 7  (cache hit)
After:  h("user:42") mod 11 = 3  →  node 3  (cache miss, DB query)

Result: ~90% of keys misrouted → database floods

Consistent hashing solves this. Place nodes on a circular hash space. Place keys on the same ring. Each key maps to the nearest node clockwise. Adding a node only migrates keys from its immediate neighbour. One-eleventh of keys move instead of nine-tenths.

Consistent hashing ring (simplified):

              0
         ┌────┴────┐
    N=11 │         │ N=3
         │  ring   │
    N=8  │         │ N=5
         └────┬────┘
              N=7

Add node between 7 and 11 at position 9:
Only keys in arc (7, 9] migrate.
All other keys stay on their current node.
~1/N keys move instead of ~N/N.

Virtual nodes complete the design. Each physical node claims 100-200 positions on the ring. This distributes load evenly and prevents hotspots from uneven node placement. Amazon's Dynamo paper made this the standard approach.

The System Design Level

Cassandra's partition strategy is consistent hashing at cluster scale. The partition key determines ring position. The replication factor determines how many nodes store each partition. The quorum formula R + W > N governs consistency guarantees.

A distributed ring eliminates the central coordinator. Every node can route any request without asking a master. The ring trades simplicity for horizontal scalability.

DynamoDB, Riak, and Cassandra all use variants of the same ring. The partition strategy determines data locality. The replication strategy determines fault tolerance. Both rest on the same h(k) mod N that started in a single-process hash table.

The Leadership Level

Technology leaders face the same mapping problem with teams. A monolithic codebase is one large space. Teams are a small number of units. Assigning code ownership is hashing: mapping a large space of responsibilities to a small number of teams.

Poor partitioning creates coupling. Two teams share one module. Neither can deploy independently. Conway's Law predicts this: the system structure mirrors the team structure.

What Stays Constant

At every layer, the core operation is identical. Map a large space to a small space. What changes is the consequence of collision.

In a hash table, collision costs a pointer chase. In a distributed cache, collision after a resize costs a cache stampede. In a database cluster, collision means hotspotting on a single partition. In an organisation, collision means two teams blocked on one module.

The algorithm stays the same. The blast radius grows with each layer.

Concept: Hashing across the computing stack

Tradeoff: AT5 — a single hash table is simple; a distributed ring eliminates the coordinator but adds partition management and quorum complexity

Failure Mode: FM6 — uneven hash distribution concentrates load on one node, one partition, or one team

Signal: You are assigning items from a large set to a small number of buckets; the blast radius of a collision is your layer's failure mode

Series: Book 2, Ch 2

*The full framework treatment — compression blocks, three-level exercises, and the complete AT/FM mapping — is in Book 2, Ch 2. Free chapters available at [computingseries.com](https://computingseries.com).*