Distributed Locks: How They Work and Why They Fail

Two services. One shared resource. No coordination. The result is not a race condition — it is a guarantee of one.

A distributed lock is the primitive that turns "two processes might collide" into "exactly one process runs at a time." It is one of the most commonly needed and most commonly broken mechanisms in distributed systems. Engineers reach for it instinctively and implement it incorrectly in ways that do not fail loudly — they fail silently, intermittently, and in production.

The Problem a Distributed Lock Solves

In a single-process system, a mutex works. One thread acquires it. Others wait. The operating system enforces the exclusion at the hardware level.

Distributed systems have no shared hardware. Two services running on two machines have no common memory, no common clock, and no way to coordinate directly. Yet they often need to coordinate: one cron job should run, not two. One write to a shared record should proceed, not a concurrent pair that overwrites each other. One payment should process, not a duplicate that charges the user twice.

The distributed lock solves this by introducing a third party — a coordination service — that both processes agree to consult before acting.

How a Distributed Lock Works

The core protocol is simple:

Process A writes a key to a coordination service (Redis, etcd, ZooKeeper) with a short TTL
If the write succeeds, Process A holds the lock
Process B tries the same write — it fails because the key exists
Process B waits and retries
When Process A finishes, it deletes the key
Process B's next retry succeeds

The TTL is the safety valve. If Process A crashes before deleting the key, the lock expires automatically. Without the TTL, a crashed process holds the lock forever.

Process A                   Redis                   Process B
   │                           │                          │
   │─── SET lock:job NX PX 5000 ─▶│                       │
   │◀── OK ─────────────────────│                         │
   │  (holds lock)             │                          │
   │                           │─── SET lock:job NX PX 5000 ◀─│
   │                           │──── (nil) ──────────────▶│
   │                           │  (key exists, fails)     │
   │  ... do work ...          │                          │
   │─── DEL lock:job ──────────▶│                         │
   │                           │                          │
   │                           │─── SET lock:job NX PX 5000 ◀─│
   │                           │──── OK ─────────────────▶│
   │                           │                          │  (holds lock)

The Redis command SET key value NX PX ttl does this atomically. NX means "only if not exists." PX ttl sets the expiry in milliseconds. One round trip. One atomic operation. No race condition on the acquisition itself.

Where It Fails

1. Expiry before work completes

Process A acquires a lock with a 5-second TTL. Its work takes 7 seconds due to a slow database query. At second 5, the lock expires. Process B acquires it. At second 7, Process A finishes and deletes the key — which is now Process B's lock.

Both processes ran simultaneously. The lock provided no exclusion.

Mitigation: Use a lock renewal pattern. Process A periodically extends the TTL while working. If it crashes, the extension stops and the lock expires normally.

2. Deleting someone else's lock

Process A holds a lock. The TTL expires. Process B acquires the lock. Process A finishes and deletes the key — deleting Process B's lock.

Mitigation: Include a unique token in the lock value. Before deleting, verify the token matches. Use a Lua script in Redis to make the check-and-delete atomic:

if redis.call("get", KEYS[1]) == ARGV[1] then
  return redis.call("del", KEYS[1])
else
  return 0
end

3. Network partition

The coordination service becomes unreachable. Process A cannot acquire or release locks. The system halts — or worse, both processes proceed without coordination because they assumed "can't reach Redis" means "safe to proceed."

Mitigation: Fail closed. If the coordination service is unreachable, do not proceed. The alternative — proceeding without a lock — is the exact scenario the lock was designed to prevent.

4. Clock skew

Distributed systems do not share a clock. Two machines' clocks can drift apart by tens of milliseconds or more. A lock that relies on wall-clock time for TTL calculation can expire earlier on one machine than expected.

Mitigation: Use monotonic time locally for lock duration. Let the coordination service manage expiry — its clock is the only one that matters for TTL.

Redlock: When One Redis Instance Is Not Enough

A single Redis instance is a single point of failure. If it goes down, all lock acquisition fails. For high-availability scenarios, the Redlock algorithm acquires a lock on a majority of N independent Redis instances (typically 5). The lock is held only if acquisition succeeds on at least (N/2 + 1) instances within a time bound.

Redlock is controversial. Martin Kleppmann's analysis identified scenarios where it fails under process pauses and clock drift. For most applications, a single Redis instance with replication is sufficient. Redlock is worth the complexity only when lock correctness is a hard safety requirement, not a performance optimization.

Distributed Locks vs. Transactions

A distributed lock is not a transaction. A transaction ensures atomicity across multiple operations on one database. A distributed lock ensures mutual exclusion across multiple processes. They solve different problems.

Using a distributed lock to simulate a transaction is a mistake that produces correctness gaps. Using a transaction where a distributed lock is needed is a mistake that produces unnecessary coupling to one database.

The Right Use Cases

Distributed locks fit a specific pattern: one operation per unit of time, enforced across multiple processes. Common examples:

Cron jobs that should not run concurrently
Cache warming tasks that should run once when cache is cold
Payment processing where double-execution causes real harm
Leader election where one node should coordinate work

They do not fit: rate limiting (use a counter with TTL instead), read access control (use permissions instead), or enforcing data integrity across records (use transactions instead).

How This Connects to the Series

Distributed locks appear in Book 3 (chapter 23) as a coordination primitive for distributed systems. They illustrate the intersection of AT1 (consistency vs. availability — a lock chooses consistency by halting when the coordination service is unreachable), FM1 (the coordination service itself is a single point of failure), and FM4 (the subtle ways processes can disagree on who holds the lock). The series returns to them in Book 4 when discussing system design patterns that require exactly-once execution semantics.