Distributed Locks

Introduction

At 2:47 AM, two background workers on separate machines both check the job queue. Both see the same pending job. Both claim it. Both execute it. The job runs twice. In one case, that means a user is charged twice for a subscription. In another, two conflicting database writes corrupt a record that will not be noticed until a customer calls.

The underlying problem is not a bug in either worker. Both workers followed their logic correctly. The problem is that mutual exclusion — the guarantee that only one process operates on shared state at a time — stops working when the processes are on different machines. A shared variable protects a critical section on a single machine. Across machines, there is no shared variable. There is only a network.

Distributed locks are the mechanism that restores mutual exclusion across machines. They are not simple. They carry failure modes that do not exist with single-machine locks. Understanding why they are hard, how they are implemented, and when to avoid them entirely is what separates systems that stay correct under failure from systems that behave correctly only when nothing goes wrong.

Thread Activation

In Book 1, Chapter 32, mutual exclusion prevents concurrent modification of shared state. A lock ensures that only one goroutine, thread, or process enters a critical section at a time. On a single machine, this is implemented with a mutex: a shared variable that the operating system helps arbitrate.

A distributed lock is mutual exclusion when the processes are on different machines. The shared variable is now a record in an external store — a database, a cache, or a coordination service. The arbitration is now a distributed protocol. The failure modes expand to include network partitions, clock skew, and process crashes while holding the lock.

Thread T9 (Coordination) runs through this chapter. Coordination is what distributed systems do when they need to behave as a unit despite running on independent machines. A distributed lock is coordination reduced to its simplest form: one process, one turn.

The Concept

A lock has three operations: acquire, hold, release. On a single machine, these are fast, cheap, and guaranteed to work. The OS kernel mediates them. If a thread holding a lock crashes, the kernel knows immediately and releases the lock.

In a distributed system, none of these guarantees hold:

Acquire requires a round-trip to an external service. That service may be slow. The network may drop the response. The acquiring process may not know whether the acquisition succeeded.

Hold has no enforcement mechanism. There is no kernel watching. If the process holding the lock crashes, nothing releases it automatically unless the lock has a time-to-live (TTL). If the lock has a TTL, it may expire while the process is still working — causing two processes to believe they hold the lock simultaneously.

Release requires a round-trip. The process may crash before releasing. The release message may be lost. Another process may have already acquired the lock due to TTL expiry.

These failure modes mean a distributed lock is not a mutex with network overhead. It is a fundamentally different primitive that requires explicit design choices about what happens when things go wrong.

Why the TTL Problem Is Worse Than It Looks

The most common distributed lock design sets a TTL on the lock. If the holder crashes, the lock expires automatically after the TTL. This seems like the right solution. It is also the source of the most dangerous failure mode.

// Process A acquires lock with TTL = 10 seconds
lock = acquire("job-42", ttl=10_seconds)

// Process A starts working
data = load_record("job-42")

// Process A is paused: garbage collector, OS scheduling, network delay
// ... 11 seconds pass ...

// Lock expires. Process B acquires the lock.
// Process B starts working on the same record.

// Process A resumes.
// Process A writes its result — overwriting Process B's work.
// Both A and B believe they completed the job successfully.

The problem is not that process A crashed. Process A is still running. A pause — a garbage collection stop-the-world event, an OS scheduling preemption, a slow network response — made the lock expire before the work completed. Now two processes are in the critical section simultaneously, and neither knows it.

This is the distributed lock’s fundamental hazard. It cannot be eliminated with a longer TTL, only mitigated. The correct mitigation is fencing tokens.

How It Works

Fencing Tokens

A fencing token is a monotonically increasing number issued by the lock service when a lock is acquired. Every lock acquisition gets a higher token than the previous one. The storage layer that the lock protects checks incoming writes against the highest token it has seen and rejects writes with old tokens.

// Lock service issues a fencing token with each lock grant
function acquire_lock(resource):
    token = atomic_increment(lock_service.counter)
    lock_record = {
        holder: this_process_id,
        token: token,
        expires_at: now() + TTL
    }
    success = lock_service.set_if_absent(resource, lock_record)
    if success:
        return Lock(resource=resource, token=token)
    else:
        return None  // lock is held by another process

// The protected storage layer enforces token ordering
function storage_write(resource, data, fencing_token):
    current_max_token = storage.get_max_seen_token(resource)
    if fencing_token <= current_max_token:
        // Reject: this write is from an old lock holder
        // Process A resumed after its lock expired — reject its write
        raise StaleWriteError("token {} rejected, current max is {}".format(
            fencing_token, current_max_token))
    storage.set_max_seen_token(resource, fencing_token)
    storage.write(resource, data)
    return success

// Process A scenario with fencing
lock_a = acquire_lock("job-42")      // A gets token=7
// ... A pauses for 11 seconds, lock expires ...
lock_b = acquire_lock("job-42")      // B gets token=8
storage_write("job-42", b_result, token=8)  // succeeds
// A resumes and tries to write
storage_write("job-42", a_result, token=7)  // REJECTED: token 7 < 8
// A receives StaleWriteError and aborts — data is correct

Fencing tokens shift the safety guarantee from the lock service to the storage layer. Even if the lock service has a bug, even if the TTL is too short, even if a process pauses for minutes, the storage layer rejects stale writes. The invariant — only the most recent lock holder’s writes succeed — is enforced where it matters most.

Redis-Based Locks (Redlock)

Redis is commonly used as a distributed lock service because it is fast and supports atomic operations. The simplest Redis lock uses SET key value NX PX ttl, which sets a key only if it does not exist and sets an expiry.

// Simple Redis lock — single node
function acquire_redis_lock(redis_client, resource, ttl_ms):
    unique_token = generate_uuid()  // unique per acquisition attempt
    result = redis_client.set(
        key=resource,
        value=unique_token,        // used to verify ownership on release
        condition=NOT_EXISTS,
        expiry=ttl_ms
    )
    if result == OK:
        return LockHandle(resource=resource, token=unique_token)
    else:
        return None

// Release must verify ownership — do not release someone else's lock
function release_redis_lock(redis_client, lock_handle):
    // Lua script ensures check-and-delete is atomic
    script = """
        if redis.call("GET", KEYS[1]) == ARGV[1] then
            return redis.call("DEL", KEYS[1])
        else
            return 0
        end
    """
    result = redis_client.eval(
        script=script,
        keys=[lock_handle.resource],
        args=[lock_handle.token]
    )
    return result == 1  // 1 = deleted, 0 = token mismatch (already expired)

The single-node Redis lock has a fatal flaw: if the Redis node fails after granting a lock but before the TTL expires, the new Redis node (from a replica promotion) has no record of the lock. Another process can acquire it immediately.

Redlock addresses this by requiring a majority of independent Redis nodes to agree:

// Redlock: acquire lock on majority of N independent Redis nodes
// N should be odd; recommend N=5
function redlock_acquire(redis_nodes, resource, ttl_ms):
    unique_token = generate_uuid()
    start_time = now_ms()
    acquired_count = 0
    acquired_nodes = []

    for node in redis_nodes:
        try:
            success = acquire_redis_lock(node, resource, ttl_ms)
            if success:
                acquired_count += 1
                acquired_nodes.append(node)
        except NetworkError:
            // Treat unreachable node as lock not acquired on that node
            continue

    elapsed_ms = now_ms() - start_time
    remaining_validity_ms = ttl_ms - elapsed_ms - CLOCK_DRIFT_MARGIN

    majority = len(redis_nodes) / 2 + 1  // floor division, then +1

    if acquired_count >= majority and remaining_validity_ms > 0:
        // Lock acquired on majority of nodes with time remaining
        return RedLock(
            token=unique_token,
            resource=resource,
            acquired_nodes=acquired_nodes,
            validity_ms=remaining_validity_ms
        )
    else:
        // Failed to acquire majority — release all partial acquisitions
        for node in acquired_nodes:
            release_redis_lock(node, LockHandle(resource, unique_token))
        return None

// Redlock release: release on all nodes where lock was acquired
function redlock_release(lock):
    for node in lock.acquired_nodes:
        try:
            release_redis_lock(node, LockHandle(lock.resource, lock.token))
        except NetworkError:
            continue  // Lock will expire via TTL if release fails

Redlock’s correctness depends on the assumption that clock drift across nodes is small relative to the TTL. If clocks drift significantly, the effective validity window may be zero even after a successful majority acquisition. Martin Kleppmann’s critique of Redlock (2016) demonstrates that it cannot guarantee safety in all network partition scenarios. For workloads where lock correctness is absolutely critical, ZooKeeper or etcd-based locks are more appropriate.

ZooKeeper-Based Locks

ZooKeeper provides a coordination service with ephemeral nodes: nodes that are automatically deleted when the client session that created them ends. A crashed client’s session expires, and its ephemeral nodes disappear. This solves the TTL problem: the lock is held for as long as the session is alive, not for a fixed time.

// ZooKeeper lock using ephemeral sequential nodes
// Each lock attempt creates an ephemeral sequential node under /locks/resource/
// ZooKeeper appends a sequence number: /locks/job-42/lock-0000000001

function zk_acquire_lock(zk_client, resource):
    base_path = "/locks/" + resource

    // Create an ephemeral sequential node — ZK assigns sequence number
    my_path = zk_client.create(
        path=base_path + "/lock-",
        data=this_process_id,
        mode=EPHEMERAL_SEQUENTIAL    // deleted when session expires
    )
    // my_path is now something like /locks/job-42/lock-0000000003

    while true:
        // List all lock nodes for this resource
        children = zk_client.get_children(base_path)
        sort(children)  // ZK sequence numbers are sortable

        my_sequence = extract_sequence(my_path)
        min_sequence = extract_sequence(children[0])

        if my_sequence == min_sequence:
            // I am the smallest — I hold the lock
            return ZKLock(path=my_path, resource=resource)

        // I am not the smallest — watch the node immediately before me
        // (not the smallest, to avoid thundering herd)
        predecessor = find_predecessor(children, my_sequence)
        if predecessor does not exist:
            // Predecessor disappeared between listing and watching — retry
            continue

        // Block until predecessor is deleted (lock released or session expired)
        zk_client.wait_for_deletion(
            path=base_path + "/" + predecessor,
            timeout=LOCK_WAIT_TIMEOUT
        )
        // Loop: re-check if I am now the smallest

function zk_release_lock(zk_client, lock):
    // Deleting the ephemeral node releases the lock
    // If the process crashes, ZK session expiry deletes it automatically
    zk_client.delete(lock.path)

The ZooKeeper lock has a key property: there is no TTL to expire prematurely. The lock is held exactly as long as the session is alive. If the process holding the lock is paused for 30 seconds, it still holds the lock — no other process can acquire it. If the process dies, ZooKeeper detects the session expiry (within session timeout, typically 2–10 seconds) and deletes the ephemeral node.

The cost of this design is that ZooKeeper itself must be highly available, which requires running 5 nodes in a quorum configuration. ZooKeeper is a serious operational commitment.

Lock-Free Alternatives: CAS and Optimistic Locking

Many workloads do not need a lock if they can use compare-and-swap (CAS): an atomic operation that updates a value only if it currently matches an expected value.

// Compare-and-swap on a distributed database (e.g., DynamoDB conditional write)
function update_record_with_cas(db, record_id, new_value):
    max_retries = 5
    for attempt in range(max_retries):
        // Read current state including version number
        current = db.get(record_id)
        current_version = current.version

        // Compute new state based on current state
        new_record = compute_new_state(current, new_value)
        new_record.version = current_version + 1

        // Write only if version hasn't changed since we read
        try:
            db.conditional_write(
                record_id=record_id,
                value=new_record,
                condition="version == " + str(current_version)
            )
            return success  // Write succeeded: we were the only writer
        except ConditionalCheckFailed:
            // Another process wrote concurrently — retry with fresh read
            if attempt == max_retries - 1:
                raise TooManyRetriesError()
            backoff(attempt)  // Exponential backoff before retry
            continue

// Optimistic locking: reads don't block; writes fail if concurrent write detected
// Good for: low write contention, short critical sections
// Bad for: high write contention (many retries waste work)

CAS eliminates the need for a lock service entirely. The storage system itself provides the atomicity guarantee. When write contention is low, CAS is faster than acquiring a distributed lock. When contention is high, CAS degrades: many writers retry repeatedly, amplifying load on the database.

The practical rule: use CAS when conflicts are rare and retries are cheap. Use distributed locks when conflicts are common and wasted work is unacceptable.

Tradeoffs

Lock granularity vs contention. A lock on the entire table serialises all writers. A lock per record allows parallel writes to different records but creates more lock records. Finer granularity means less contention but more coordination overhead. The right granularity is the smallest scope that prevents the conflict you care about.

TTL length vs safety window. A long TTL means the lock survives slow operations but leaves the system locked longer when a process crashes. A short TTL reduces crash recovery time but risks expiry during legitimate slow operations. There is no TTL that is always right; the correct value depends on the maximum expected operation duration.

Distributed lock vs application-level serialisation. Sometimes the correct answer is not a distributed lock at all. If work can be partitioned — each worker owns a disjoint subset of records — there is no shared state to protect and no lock is needed. Partition the work before reaching for the lock.

Lock availability vs consistency. AT1: a distributed lock enforces consistency (only one writer at a time) at the cost of availability during failure. If the lock service is unavailable, no work can proceed. A CAS-based approach degrades to retry loops rather than blocking entirely — lower consistency guarantee, higher availability under lock-service failure.

Where It Fails

FM12 — Split-Brain: two processes both believe they hold the lock. This happens when TTL expires while the lock holder is paused. Process A holds the lock, pauses for longer than the TTL, and process B acquires the lock. Process A resumes, does not check whether the lock is still valid, and both A and B are in the critical section simultaneously. Fencing tokens prevent the damage from this scenario; they do not prevent the scenario itself.

FM5 — Latency Amplification from Lock Contention. When many workers compete for the same lock, each waits behind all others. If the lock-holder’s operation takes 500ms and 50 workers are waiting, the last worker waits 25 seconds. Lock contention is latency amplification applied to the entire queue of waiters. Diagnosis: measure lock wait time separately from lock hold time. If wait time dominates, the lock is the bottleneck.

Dead lock across two resources. Process A holds lock-1 and waits for lock-2. Process B holds lock-2 and waits for lock-1. Both wait forever. Distributed deadlock is harder to detect than single-machine deadlock because there is no kernel to inspect the lock graph. Prevention: always acquire multiple locks in a consistent order (lock with smaller ID first).

Lock renewal failure. Long-running operations need to renew the lock before TTL expires. If the renewal request fails — network hiccup, lock service overload — the lock expires mid-operation. Renewal adds complexity and a second category of failure: the operation must handle lock expiry, not just lock acquisition failure.

Real Systems

Redis Redlock is the most widely deployed distributed lock implementation. Its limitations are well-documented; it is appropriate for use cases where the consequence of occasional split-brain is low (idempotent operations, best-effort deduplication). It is not appropriate for use cases where correctness is critical (financial transactions, database record updates).

Apache ZooKeeper provides ephemeral nodes and session-based locking. HBase uses ZooKeeper for region server coordination. Kafka used ZooKeeper for leader election and partition assignment until KRaft replaced it in Kafka 3.x.

etcd (part of Kubernetes) provides a similar abstraction to ZooKeeper: lease-based locks with session expiry. Kubernetes uses etcd for all cluster state, including leader election for the controller manager.

Google Chubby is the internal distributed lock service that inspired ZooKeeper. It provides a coarse-grained locking service built on Paxos. Chubby is used for leader election, name service, and distributed configuration in Google’s infrastructure.

DynamoDB conditional writes implement optimistic locking at the storage layer. DynamoDB’s ConditionExpression allows CAS-style updates: “write only if this attribute equals this value.” This pattern is used in AWS Step Functions and DynamoDB Streams to implement exactly-once processing without an external lock service.

Concept: Distributed Lock

Thread: T9 (Consensus)

Core Idea: Mutual exclusion across machines requires an external arbitration service; the service’s failure modes (TTL expiry, network partition, crash) introduce split-brain risks that single-machine locks do not have.

Tradeoff: AT1 — Consistency/Availability: a distributed lock enforces consistency (one writer) at the cost of availability (workers block when the lock service is unavailable).

Failure Mode: FM12 — Split-Brain: a paused process resumes after TTL expiry while another process has acquired the lock; both believe they are the sole holder.

Signal: When two instances of a background worker produce conflicting results for the same job, or a database record contains partially applied updates from two concurrent writers.

Maps to: AT1, AT9, FM5, FM12

Exercises

Level 2 — Apply

A batch processing system uses a Redis Redlock to ensure only one worker processes each job. Workers sometimes pause for up to 8 seconds during garbage collection. The current TTL is 5 seconds. Describe two separate failure scenarios caused by this TTL, and explain how fencing tokens would change the outcome of each scenario.
A system uses ZooKeeper ephemeral nodes for distributed locking. The ZooKeeper session timeout is set to 2 seconds. A network partition isolates a lock holder for 3 seconds before recovering. Walk through what happens to the lock during and after the partition, and identify which process holds the lock after recovery.
A database has two types of writes: high-frequency low-conflict user profile updates (thousands of users, each user’s record updated independently) and low-frequency high-conflict inventory reservation (hundreds of workers competing for the same stock records). Choose between CAS and distributed locks for each workload and justify the choice in terms of contention and retry cost.

Level 3 — Design

Design a distributed lock service for a payment processing system that cannot tolerate duplicate charges. The system runs on five geographically distributed data centres. Define the lock acquisition protocol, the fencing token mechanism, the TTL strategy, and what the application does when lock acquisition fails. Identify the residual failure mode that your design does not eliminate.

A complete answer will: (1) specify the lock acquisition protocol using a quorum of 3 of 5 data centres (Redlock or equivalent) with a fencing token that increments monotonically on each acquisition, (2) identify FM12 (network partition causing a lock holder to lose quorum and a second client to acquire the lock) as the residual failure mode that the design cannot fully eliminate, (3) address the AT1 tradeoff between lock TTL length (shorter = faster recovery from holder failure, higher risk of lock expiry during slow operations) and TTL that is too long (holder can complete but recovery is slow), and (4) specify what the application does when lock acquisition fails — return an error to the caller rather than waiting or retrying indefinitely — and justify this as the correct behaviour for a payment system.

A job queue has one million pending jobs distributed across one hundred workers. Workers acquire a per-job lock before processing. At peak, 80% of workers are waiting for locks simultaneously. Design a partitioning strategy that eliminates or reduces lock contention. Explain what invariant your design preserves in place of the lock, and what new failure mode it introduces.

A complete answer will: (1) identify the correct partitioning strategy — assign jobs to workers by hashing job_id to worker_id, so each worker owns a disjoint partition and no lock is needed, (2) identify FM1 (single point of failure — if a worker dies, its partition stalls until reassigned) as the new failure mode the partitioning strategy introduces in place of the lock, (3) address the AT5 tradeoff between centralised locking (all workers can process any job, higher contention) and partitioned ownership (no contention, worker failure blocks a partition), and (4) specify the invariant the design preserves — exactly-once processing per job — and describe the rebalancing protocol when a worker joins or leaves.

You are building a distributed rate limiter that enforces a per-user limit of 100 requests per minute across a cluster of 50 API servers. The naive approach uses a distributed lock per user per minute window. Analyse why this design has prohibitive latency under load, then design a token bucket implementation using Redis atomic operations (without a distributed lock) that provides approximate rate limiting with bounded error. Define what “approximate” means in terms of the percentage of requests that could exceed the limit in the worst case.

A complete answer will: (1) correctly explain why the per-user distributed lock creates a serialisation bottleneck — each request must acquire and release a lock on a shared resource, limiting throughput to 1 request per lock RTT per user — and quantify the latency impact at high concurrency, (2) design the Redis token bucket using EVALSHA or a Lua script that atomically reads the bucket count and decrements it in a single round-trip, with the approximate error bounded by the number of API servers that can race before the next Redis sync, (3) identify FM6 (a single Redis shard serving all user buckets becomes a hotspot for high-traffic users) as the failure mode, and (4) define “approximate” precisely — up to N×limit requests may be accepted in a window where N is the number of API servers in the worst-case race, before the bucket is synchronised.

Read in the book →