The Concept

A lock has three operations: acquire, hold, release. On a single machine, these are fast, cheap, and guaranteed to work. The OS kernel mediates them. If a thread holding a lock crashes, the kernel knows immediately and releases the lock.

In a distributed system, none of these guarantees hold:

Acquire requires a round-trip to an external service. That service may be slow. The network may drop the response. The acquiring process may not know whether the acquisition succeeded.

Hold has no enforcement mechanism. There is no kernel watching. If the process holding the lock crashes, nothing releases it automatically unless the lock has a time-to-live (TTL). If the lock has a TTL, it may expire while the process is still working — causing two processes to believe they hold the lock simultaneously.

Release requires a round-trip. The process may crash before releasing. The release message may be lost. Another process may have already acquired the lock due to TTL expiry.

These failure modes mean a distributed lock is not a mutex with network overhead. It is a fundamentally different primitive that requires explicit design choices about what happens when things go wrong.

Why the TTL Problem Is Worse Than It Looks

The most common distributed lock design sets a TTL on the lock. If the holder crashes, the lock expires automatically after the TTL. This seems like the right solution. It is also the source of the most dangerous failure mode.

// Process A acquires lock with TTL = 10 seconds
lock = acquire("job-42", ttl=10_seconds)

// Process A starts working
data = load_record("job-42")

// Process A is paused: garbage collector, OS scheduling, network delay
// ... 11 seconds pass ...

// Lock expires. Process B acquires the lock.
// Process B starts working on the same record.

// Process A resumes.
// Process A writes its result — overwriting Process B's work.
// Both A and B believe they completed the job successfully.

The problem is not that process A crashed. Process A is still running. A pause — a garbage collection stop-the-world event, an OS scheduling preemption, a slow network response — made the lock expire before the work completed. Now two processes are in the critical section simultaneously, and neither knows it.

This is the distributed lock’s fundamental hazard. It cannot be eliminated with a longer TTL, only mitigated. The correct mitigation is fencing tokens.

Read in the book →

← Thread Activation How It Works →