Hashing in the Wild

Introduction

In 2012, GitHub added a feature: every file in every repository has a permanent address — the SHA-1 hash of its contents. Not a path. Not a filename. The hash. Change one byte in the file and the address changes. The file at the old address remains permanently available and permanently unchanged.

This is not a convenience feature. It is an architectural decision with a specific algorithmic basis. The hash function makes storage immutable. Immutable storage makes branching cheap. Cheap branching makes version control distributed. Every Git operation you have ever run — clone, branch, merge, push — depends on the properties of a hash function.

You built hash tables in Book 1, Chapter 22. You saw modular arithmetic on a ring in Chapter 39. Now those structures appear inside three systems you use daily. The shape is identical. The scale and the consequences are different.

Thread Activation

In Book 1, Ch 22, you built a hash table: a hash function maps keys to bucket indices; collisions are resolved by chaining or probing. In Ch 39, you saw modular arithmetic applied to a ring — the basis of consistent hashing.

This chapter traces Thread 1 (Hashing) into three production systems. Each system uses the same mathematical properties — uniform distribution, determinism, and (in the case of SHA-1) collision resistance — but applies them to different engineering problems. The thread continues in Book 3, Chapters 4–5, where consistent hashing becomes the foundation of distributed database sharding.

The Concept

A hash function takes an input of arbitrary length and produces a fixed-length output (the hash, digest, or fingerprint). Two properties make hash functions useful in engineering:

Determinism: the same input always produces the same output. Given the same content, the hash is always the same value.
Uniform distribution: outputs are distributed uniformly across the output space. This prevents hotspotting — no single output bucket is disproportionately likely.

Cryptographic hash functions add a third property:

Collision resistance: it is computationally infeasible to find two different inputs that produce the same output. This enables trust: if two files have the same SHA-256 hash, you can be confident (but not certain) they have identical content.

The engineering applications of hash functions vary, but each one exploits one or more of these three properties.

How It Works

Git: Content-Addressable Storage

Git stores every version of every file as a blob — an object identified by the SHA-1 hash of its contents. The file path is irrelevant to storage. The content determines the address.

This creates content-addressable storage (CAS). In a content-addressable store, the key is derived from the value, not assigned independently.

import hashlib

def git_blob_hash(content: bytes) -> str:
    """
    Compute the Git blob SHA-1 for file content.
    Git prepends 'blob <size>\0' before hashing.
    """
    header = f"blob {len(content)}\0".encode()
    sha1 = hashlib.sha1(header + content)
    return sha1.hexdigest()

# The same content always produces the same address
content = b"def hello():\n    return 'world'\n"
address = git_blob_hash(content)
# address is always "9b0d5e6a..." regardless of filename or location

The architectural consequence is profound. If you modify one file in a 10,000-file repository, only one blob changes. All other blobs are shared. Git does not copy unchanged files across commits — it references the same blobs. A branch is not a copy of the repository; it is a new commit object that points to the same tree of blobs. Creating a branch is O(1) regardless of repository size.

The hash function’s determinism enables deduplication. The hash function’s collision resistance enables trust — you can verify that a file you fetched matches the file that was committed, without trusting the network or the server.

Load Balancers: Consistent Hashing

A load balancer must distribute requests across N servers. The naive approach: server = servers[hash(request_key) % N]. This works until a server is added or removed. At that point, N changes, and hash(key) % N changes for almost every key. Every session that was pinned to server 3 is now pinned to a different server. Sessions break.

Consistent hashing solves this. Both keys and servers are mapped to the same circular ring using the same hash function. A key is served by the first server clockwise from the key’s position on the ring.

import hashlib
import bisect

class ConsistentHashRing:
    """
    Consistent hash ring: only O(K/N) keys migrate when a node is added or removed,
    where K is the total number of keys and N is the number of nodes.
    """
    def __init__(self):
        self._ring: dict[int, str] = {}   # hash position -> node name
        self._sorted_keys: list[int] = []

    def _hash(self, key: str) -> int:
        return int(hashlib.sha256(key.encode()).hexdigest(), 16)

    def add_node(self, node: str) -> None:
        h = self._hash(node)
        self._ring[h] = node
        bisect.insort(self._sorted_keys, h)

    def remove_node(self, node: str) -> None:
        h = self._hash(node)
        self._ring.pop(h, None)
        self._sorted_keys.remove(h)

    def get_node(self, key: str) -> str:
        if not self._ring:
            raise ValueError("Ring is empty")
        h = self._hash(key)
        # Find the first node clockwise from key's position
        idx = bisect.bisect_right(self._sorted_keys, h) % len(self._sorted_keys)
        return self._ring[self._sorted_keys[idx]]

When a server is added or removed, only the keys between the new/removed node and its predecessor on the ring migrate. On average, K/N keys migrate when one of N nodes changes — compared to nearly all K keys with modular hashing.

Virtual Nodes: Fixing Uneven Distribution

A three-node ring with one hash position per node produces uneven distribution. With three positions on a 360-degree ring, the spacing between them is random — one arc might cover 180 degrees, another 10 degrees, another 170 degrees. The server on the large arc handles far more traffic.

The fix is virtual nodes: each physical server is mapped to multiple positions on the ring (100–200 is typical). The positions for “server-A” are derived by hashing “server-A-1”, “server-A-2”, …, “server-A-200”.

class VirtualNodeRing(ConsistentHashRing):
    """
    Consistent hash ring with virtual nodes for even distribution.
    More virtual nodes per server = more even load distribution.
    """
    def __init__(self, virtual_nodes: int = 150):
        super().__init__()
        self._virtual_nodes = virtual_nodes

    def add_node(self, node: str) -> None:
        for i in range(self._virtual_nodes):
            virtual_key = f"{node}-vnode-{i}"
            h = self._hash(virtual_key)
            self._ring[h] = node
            bisect.insort(self._sorted_keys, h)

    def remove_node(self, node: str) -> None:
        for i in range(self._virtual_nodes):
            virtual_key = f"{node}-vnode-{i}"
            h = self._hash(virtual_key)
            self._ring.pop(h, None)
            if h in self._sorted_keys:
                self._sorted_keys.remove(h)

With 150 virtual nodes per server, the load deviation across servers drops to within a few percent, regardless of the number of physical servers.

Distributed Caches: Sharding by Key

A distributed cache (Redis Cluster, Memcached) must decide which cache node holds a given key. The key-to-node mapping must be consistent — the same key must always go to the same node — and must survive node additions and removals without invalidating most cached data.

This is exactly the load-balancer problem. Redis Cluster uses a variant: it divides the key space into 16,384 fixed hash slots (slot = crc16(key) % 16384). Each node owns a contiguous range of slots. When a node is added or removed, only the affected slots migrate.

# Redis Cluster slot assignment (simplified)
def redis_slot(key: str) -> int:
    """Returns the hash slot for a key (0–16383)."""
    import binascii
    # If key contains braces, only hash the content inside {}
    # This enables co-location: {user:123}:profile and {user:123}:settings
    # will always land on the same slot
    start = key.find('{')
    end = key.find('}', start)
    if start != -1 and end != -1 and end > start + 1:
        key = key[start + 1:end]
    return binascii.crc_hqx(key.encode(), 0) % 16384

The hash tag {} mechanism is a pragmatic escape hatch: if two keys must be on the same node (for atomic operations), wrap the shared identifier in braces. The hash function then only hashes the content inside the braces, guaranteeing co-location.

Tradeoffs

AT3 — Simplicity vs. Flexibility

Naive modular hashing (hash(key) % N) is simple: O(1), one line of code, easy to understand. Consistent hashing is more complex: a ring data structure, O(log N) lookup via binary search, virtual nodes to correct uneven distribution. The cost of flexibility is complexity.

The decision point is: does your server pool change? If N is fixed forever, modular hashing is correct. If N changes — due to failures, autoscaling, planned capacity changes — the cost of modular hashing is session breakage on every cluster change. For any system that operates at scale, the server pool changes. Consistent hashing is the standard choice.

AT9 — Correctness vs. Performance

SHA-1 (used by Git) is cryptographic: collision-resistant but slow (~200 MB/s on modern hardware). MurmurHash and xxHash are non-cryptographic: not collision-resistant but extremely fast (5–10 GB/s). For a load balancer routing decision, collision resistance is irrelevant — you do not need a security guarantee, you need speed. For Git, where the hash is a trust anchor, collision resistance matters. Use SHA-2 or SHA-3 for security; use MurmurHash or xxHash for performance.

Where It Fails

FM6 — Hotspotting

Without virtual nodes, consistent hashing produces uneven distribution. A three-node ring will, by the random placement of three points on a circle, almost certainly have one arc that is substantially larger than the others. The node responsible for the large arc handles more traffic. Virtual nodes fix this by creating many more points, averaging out the distribution.

The failure mode is silent: the system works correctly, but one node is overloaded. Monitor per-node request rates and connection counts. If one node consistently handles 40% of traffic in a three-node pool, you have hotspotting.

FM9 — Silent Data Corruption

Git’s use of SHA-1 has a known weakness: SHA-1 is theoretically broken for collision resistance (demonstrated by the SHAttered attack in 2017). An attacker who can construct two different files with the same SHA-1 hash can substitute one for the other in a repository. Git has migrated to SHA-256 for new repositories for this reason. The failure mode (two different objects with the same address) would corrupt the DAG silently.

Real Systems

DynamoDB uses consistent hashing to distribute data across nodes. Each item’s partition key is hashed using a consistent hash function; the hash determines which partition the item lives on. When DynamoDB adds capacity, only the partitions adjacent to the new nodes need to migrate data.

Cassandra uses a consistent hash ring with virtual nodes (called “vnodes” in Cassandra documentation). Each node is assigned multiple token ranges on the ring. The token range assignment is configurable; nodes with more capacity receive more tokens.

CDNs (Akamai, Fastly, CloudFlare) use consistent hashing to route requests to edge nodes. A request for a given URL is always routed to the same edge node (or a small set of edge nodes), maximising cache hit rates. When edge nodes are added (during a traffic spike), consistent hashing ensures most cached content remains at the same nodes.

Concept: Consistent Hashing

Core Idea: Map both keys and nodes onto a circular ring using the same hash function. A key is served by the first node clockwise from its ring position. When a node joins or leaves, only the keys between the new/removed node and its predecessor migrate — on average K/N keys out of K total. Virtual nodes (multiple ring positions per physical node) correct uneven load distribution.

Tradeoff: AT3 — Simplicity vs. Flexibility (modular hashing is O(1) and trivial but breaks all sessions on cluster change; consistent hashing is O(log N) and complex but minimises migration)

Failure Mode: FM6 — Hotspotting (without virtual nodes, arc-length variance creates uneven load across nodes)

Signal: When node membership changes frequently (failures, autoscaling, rolling deployments) and session affinity or cache locality must survive those changes.

Exercises

Level 1 — Understand

What property of a hash function makes Git’s content-addressable storage work? What would break if two different files could produce the same hash?
Explain the difference between modular hashing (hash(key) % N) and consistent hashing. In which scenario does modular hashing break, and what is the consequence?
What problem do virtual nodes solve in a consistent hash ring? What is the cost of using more virtual nodes?

Level 2 — Apply

A consistent hash ring has three nodes: A, B, C. Their positions on the ring (0–99) are: A=10, B=45, C=72.

Which node handles keys that hash to: 5, 30, 60, 80?
Node B is removed. Which keys migrate, and to which node?
Node D is added at position 55. Which keys migrate, and to which node?

Level 3 — Design

You are designing the routing layer for a distributed session store. The session store has 10 nodes. Sessions must be sticky — a user’s session must always be routed to the same node for the duration of the session. The node pool changes: nodes fail ~2 times per month and are replaced; capacity scales up by adding 2–3 nodes per quarter.

Design the routing strategy. Choose between modular hashing and consistent hashing. Name the tradeoff explicitly.
How many virtual nodes per physical node would you configure? What determines this number?
When a node fails mid-session, some sessions are lost. Propose a strategy to reduce session loss on node failure without breaking the consistent hashing model. Name every tradeoff you introduce.

A complete answer will: define the chosen approach, quantify its cost or complexity, and identify what breaks first as scale or load increases.

Read in the book →