Git stores every version of every file as a blob — an object identified by the SHA-1 hash of its contents. The file path is irrelevant to storage. The content determines the address.
This creates content-addressable storage (CAS). In a content-addressable store, the key is derived from the value, not assigned independently.
import hashlib
def git_blob_hash(content: bytes) -> str:
"""
Compute the Git blob SHA-1 for file content.
Git prepends 'blob <size>\0' before hashing.
"""
header = f"blob {len(content)}\0".encode()
sha1 = hashlib.sha1(header + content)
return sha1.hexdigest()
# The same content always produces the same address
content = b"def hello():\n return 'world'\n"
address = git_blob_hash(content)
# address is always "9b0d5e6a..." regardless of filename or locationThe architectural consequence is profound. If you modify one file in a 10,000-file repository, only one blob changes. All other blobs are shared. Git does not copy unchanged files across commits — it references the same blobs. A branch is not a copy of the repository; it is a new commit object that points to the same tree of blobs. Creating a branch is O(1) regardless of repository size.
The hash function’s determinism enables deduplication. The hash function’s collision resistance enables trust — you can verify that a file you fetched matches the file that was committed, without trusting the network or the server.
A load balancer must distribute requests across N servers. The naive
approach: server = servers[hash(request_key) % N]. This
works until a server is added or removed. At that point, N
changes, and hash(key) % N changes for almost every key.
Every session that was pinned to server 3 is now pinned to a different
server. Sessions break.
Consistent hashing solves this. Both keys and servers are mapped to the same circular ring using the same hash function. A key is served by the first server clockwise from the key’s position on the ring.
import hashlib
import bisect
class ConsistentHashRing:
"""
Consistent hash ring: only O(K/N) keys migrate when a node is added or removed,
where K is the total number of keys and N is the number of nodes.
"""
def __init__(self):
self._ring: dict[int, str] = {} # hash position -> node name
self._sorted_keys: list[int] = []
def _hash(self, key: str) -> int:
return int(hashlib.sha256(key.encode()).hexdigest(), 16)
def add_node(self, node: str) -> None:
h = self._hash(node)
self._ring[h] = node
bisect.insort(self._sorted_keys, h)
def remove_node(self, node: str) -> None:
h = self._hash(node)
self._ring.pop(h, None)
self._sorted_keys.remove(h)
def get_node(self, key: str) -> str:
if not self._ring:
raise ValueError("Ring is empty")
h = self._hash(key)
# Find the first node clockwise from key's position
idx = bisect.bisect_right(self._sorted_keys, h) % len(self._sorted_keys)
return self._ring[self._sorted_keys[idx]]When a server is added or removed, only the keys between the new/removed node and its predecessor on the ring migrate. On average, K/N keys migrate when one of N nodes changes — compared to nearly all K keys with modular hashing.
A three-node ring with one hash position per node produces uneven distribution. With three positions on a 360-degree ring, the spacing between them is random — one arc might cover 180 degrees, another 10 degrees, another 170 degrees. The server on the large arc handles far more traffic.
The fix is virtual nodes: each physical server is mapped to multiple positions on the ring (100–200 is typical). The positions for “server-A” are derived by hashing “server-A-1”, “server-A-2”, …, “server-A-200”.
class VirtualNodeRing(ConsistentHashRing):
"""
Consistent hash ring with virtual nodes for even distribution.
More virtual nodes per server = more even load distribution.
"""
def __init__(self, virtual_nodes: int = 150):
super().__init__()
self._virtual_nodes = virtual_nodes
def add_node(self, node: str) -> None:
for i in range(self._virtual_nodes):
virtual_key = f"{node}-vnode-{i}"
h = self._hash(virtual_key)
self._ring[h] = node
bisect.insort(self._sorted_keys, h)
def remove_node(self, node: str) -> None:
for i in range(self._virtual_nodes):
virtual_key = f"{node}-vnode-{i}"
h = self._hash(virtual_key)
self._ring.pop(h, None)
if h in self._sorted_keys:
self._sorted_keys.remove(h)With 150 virtual nodes per server, the load deviation across servers drops to within a few percent, regardless of the number of physical servers.
A distributed cache (Redis Cluster, Memcached) must decide which cache node holds a given key. The key-to-node mapping must be consistent — the same key must always go to the same node — and must survive node additions and removals without invalidating most cached data.
This is exactly the load-balancer problem. Redis Cluster uses a
variant: it divides the key space into 16,384 fixed hash
slots (slot = crc16(key) % 16384). Each node owns
a contiguous range of slots. When a node is added or removed, only the
affected slots migrate.
# Redis Cluster slot assignment (simplified)
def redis_slot(key: str) -> int:
"""Returns the hash slot for a key (0–16383)."""
import binascii
# If key contains braces, only hash the content inside {}
# This enables co-location: {user:123}:profile and {user:123}:settings
# will always land on the same slot
start = key.find('{')
end = key.find('}', start)
if start != -1 and end != -1 and end > start + 1:
key = key[start + 1:end]
return binascii.crc_hqx(key.encode(), 0) % 16384The hash tag {} mechanism is a pragmatic escape hatch:
if two keys must be on the same node (for atomic operations), wrap the
shared identifier in braces. The hash function then only hashes the
content inside the braces, guaranteeing co-location.