The Computing Series

How It Works

Git: Content-Addressable Storage

Git stores every version of every file as a blob — an object identified by the SHA-1 hash of its contents. The file path is irrelevant to storage. The content determines the address.

This creates content-addressable storage (CAS). In a content-addressable store, the key is derived from the value, not assigned independently.

import hashlib

def git_blob_hash(content: bytes) -> str:
    """
    Compute the Git blob SHA-1 for file content.
    Git prepends 'blob <size>\0' before hashing.
    """
    header = f"blob {len(content)}\0".encode()
    sha1 = hashlib.sha1(header + content)
    return sha1.hexdigest()

# The same content always produces the same address
content = b"def hello():\n    return 'world'\n"
address = git_blob_hash(content)
# address is always "9b0d5e6a..." regardless of filename or location

The architectural consequence is profound. If you modify one file in a 10,000-file repository, only one blob changes. All other blobs are shared. Git does not copy unchanged files across commits — it references the same blobs. A branch is not a copy of the repository; it is a new commit object that points to the same tree of blobs. Creating a branch is O(1) regardless of repository size.

The hash function’s determinism enables deduplication. The hash function’s collision resistance enables trust — you can verify that a file you fetched matches the file that was committed, without trusting the network or the server.

Load Balancers: Consistent Hashing

A load balancer must distribute requests across N servers. The naive approach: server = servers[hash(request_key) % N]. This works until a server is added or removed. At that point, N changes, and hash(key) % N changes for almost every key. Every session that was pinned to server 3 is now pinned to a different server. Sessions break.

Consistent hashing solves this. Both keys and servers are mapped to the same circular ring using the same hash function. A key is served by the first server clockwise from the key’s position on the ring.

import hashlib
import bisect

class ConsistentHashRing:
    """
    Consistent hash ring: only O(K/N) keys migrate when a node is added or removed,
    where K is the total number of keys and N is the number of nodes.
    """
    def __init__(self):
        self._ring: dict[int, str] = {}   # hash position -> node name
        self._sorted_keys: list[int] = []

    def _hash(self, key: str) -> int:
        return int(hashlib.sha256(key.encode()).hexdigest(), 16)

    def add_node(self, node: str) -> None:
        h = self._hash(node)
        self._ring[h] = node
        bisect.insort(self._sorted_keys, h)

    def remove_node(self, node: str) -> None:
        h = self._hash(node)
        self._ring.pop(h, None)
        self._sorted_keys.remove(h)

    def get_node(self, key: str) -> str:
        if not self._ring:
            raise ValueError("Ring is empty")
        h = self._hash(key)
        # Find the first node clockwise from key's position
        idx = bisect.bisect_right(self._sorted_keys, h) % len(self._sorted_keys)
        return self._ring[self._sorted_keys[idx]]

When a server is added or removed, only the keys between the new/removed node and its predecessor on the ring migrate. On average, K/N keys migrate when one of N nodes changes — compared to nearly all K keys with modular hashing.

Virtual Nodes: Fixing Uneven Distribution

A three-node ring with one hash position per node produces uneven distribution. With three positions on a 360-degree ring, the spacing between them is random — one arc might cover 180 degrees, another 10 degrees, another 170 degrees. The server on the large arc handles far more traffic.

The fix is virtual nodes: each physical server is mapped to multiple positions on the ring (100–200 is typical). The positions for “server-A” are derived by hashing “server-A-1”, “server-A-2”, …, “server-A-200”.

class VirtualNodeRing(ConsistentHashRing):
    """
    Consistent hash ring with virtual nodes for even distribution.
    More virtual nodes per server = more even load distribution.
    """
    def __init__(self, virtual_nodes: int = 150):
        super().__init__()
        self._virtual_nodes = virtual_nodes

    def add_node(self, node: str) -> None:
        for i in range(self._virtual_nodes):
            virtual_key = f"{node}-vnode-{i}"
            h = self._hash(virtual_key)
            self._ring[h] = node
            bisect.insort(self._sorted_keys, h)

    def remove_node(self, node: str) -> None:
        for i in range(self._virtual_nodes):
            virtual_key = f"{node}-vnode-{i}"
            h = self._hash(virtual_key)
            self._ring.pop(h, None)
            if h in self._sorted_keys:
                self._sorted_keys.remove(h)

With 150 virtual nodes per server, the load deviation across servers drops to within a few percent, regardless of the number of physical servers.

Distributed Caches: Sharding by Key

A distributed cache (Redis Cluster, Memcached) must decide which cache node holds a given key. The key-to-node mapping must be consistent — the same key must always go to the same node — and must survive node additions and removals without invalidating most cached data.

This is exactly the load-balancer problem. Redis Cluster uses a variant: it divides the key space into 16,384 fixed hash slots (slot = crc16(key) % 16384). Each node owns a contiguous range of slots. When a node is added or removed, only the affected slots migrate.

# Redis Cluster slot assignment (simplified)
def redis_slot(key: str) -> int:
    """Returns the hash slot for a key (0–16383)."""
    import binascii
    # If key contains braces, only hash the content inside {}
    # This enables co-location: {user:123}:profile and {user:123}:settings
    # will always land on the same slot
    start = key.find('{')
    end = key.find('}', start)
    if start != -1 and end != -1 and end > start + 1:
        key = key[start + 1:end]
    return binascii.crc_hqx(key.encode(), 0) % 16384

The hash tag {} mechanism is a pragmatic escape hatch: if two keys must be on the same node (for atomic operations), wrap the shared identifier in braces. The hash function then only hashes the content inside the braces, guaranteeing co-location.


Read in the book →