Introduction

At 2:47 AM, two background workers on separate machines both check the job queue. Both see the same pending job. Both claim it. Both execute it. The job runs twice. In one case, that means a user is charged twice for a subscription. In another, two conflicting database writes corrupt a record that will not be noticed until a customer calls.

The underlying problem is not a bug in either worker. Both workers followed their logic correctly. The problem is that mutual exclusion — the guarantee that only one process operates on shared state at a time — stops working when the processes are on different machines. A shared variable protects a critical section on a single machine. Across machines, there is no shared variable. There is only a network.

Distributed locks are the mechanism that restores mutual exclusion across machines. They are not simple. They carry failure modes that do not exist with single-machine locks. Understanding why they are hard, how they are implemented, and when to avoid them entirely is what separates systems that stay correct under failure from systems that behave correctly only when nothing goes wrong.

Read in the book →

← Distributed Locks Thread Activation →