The Computing Series

How It Works

Fencing Tokens

A fencing token is a monotonically increasing number issued by the lock service when a lock is acquired. Every lock acquisition gets a higher token than the previous one. The storage layer that the lock protects checks incoming writes against the highest token it has seen and rejects writes with old tokens.

// Lock service issues a fencing token with each lock grant
function acquire_lock(resource):
    token = atomic_increment(lock_service.counter)
    lock_record = {
        holder: this_process_id,
        token: token,
        expires_at: now() + TTL
    }
    success = lock_service.set_if_absent(resource, lock_record)
    if success:
        return Lock(resource=resource, token=token)
    else:
        return None  // lock is held by another process

// The protected storage layer enforces token ordering
function storage_write(resource, data, fencing_token):
    current_max_token = storage.get_max_seen_token(resource)
    if fencing_token <= current_max_token:
        // Reject: this write is from an old lock holder
        // Process A resumed after its lock expired — reject its write
        raise StaleWriteError("token {} rejected, current max is {}".format(
            fencing_token, current_max_token))
    storage.set_max_seen_token(resource, fencing_token)
    storage.write(resource, data)
    return success

// Process A scenario with fencing
lock_a = acquire_lock("job-42")      // A gets token=7
// ... A pauses for 11 seconds, lock expires ...
lock_b = acquire_lock("job-42")      // B gets token=8
storage_write("job-42", b_result, token=8)  // succeeds
// A resumes and tries to write
storage_write("job-42", a_result, token=7)  // REJECTED: token 7 < 8
// A receives StaleWriteError and aborts — data is correct

Fencing tokens shift the safety guarantee from the lock service to the storage layer. Even if the lock service has a bug, even if the TTL is too short, even if a process pauses for minutes, the storage layer rejects stale writes. The invariant — only the most recent lock holder’s writes succeed — is enforced where it matters most.

Redis-Based Locks (Redlock)

Redis is commonly used as a distributed lock service because it is fast and supports atomic operations. The simplest Redis lock uses SET key value NX PX ttl, which sets a key only if it does not exist and sets an expiry.

// Simple Redis lock — single node
function acquire_redis_lock(redis_client, resource, ttl_ms):
    unique_token = generate_uuid()  // unique per acquisition attempt
    result = redis_client.set(
        key=resource,
        value=unique_token,        // used to verify ownership on release
        condition=NOT_EXISTS,
        expiry=ttl_ms
    )
    if result == OK:
        return LockHandle(resource=resource, token=unique_token)
    else:
        return None

// Release must verify ownership — do not release someone else's lock
function release_redis_lock(redis_client, lock_handle):
    // Lua script ensures check-and-delete is atomic
    script = """
        if redis.call("GET", KEYS[1]) == ARGV[1] then
            return redis.call("DEL", KEYS[1])
        else
            return 0
        end
    """
    result = redis_client.eval(
        script=script,
        keys=[lock_handle.resource],
        args=[lock_handle.token]
    )
    return result == 1  // 1 = deleted, 0 = token mismatch (already expired)

The single-node Redis lock has a fatal flaw: if the Redis node fails after granting a lock but before the TTL expires, the new Redis node (from a replica promotion) has no record of the lock. Another process can acquire it immediately.

Redlock addresses this by requiring a majority of independent Redis nodes to agree:

// Redlock: acquire lock on majority of N independent Redis nodes
// N should be odd; recommend N=5
function redlock_acquire(redis_nodes, resource, ttl_ms):
    unique_token = generate_uuid()
    start_time = now_ms()
    acquired_count = 0
    acquired_nodes = []

    for node in redis_nodes:
        try:
            success = acquire_redis_lock(node, resource, ttl_ms)
            if success:
                acquired_count += 1
                acquired_nodes.append(node)
        except NetworkError:
            // Treat unreachable node as lock not acquired on that node
            continue

    elapsed_ms = now_ms() - start_time
    remaining_validity_ms = ttl_ms - elapsed_ms - CLOCK_DRIFT_MARGIN

    majority = len(redis_nodes) / 2 + 1  // floor division, then +1

    if acquired_count >= majority and remaining_validity_ms > 0:
        // Lock acquired on majority of nodes with time remaining
        return RedLock(
            token=unique_token,
            resource=resource,
            acquired_nodes=acquired_nodes,
            validity_ms=remaining_validity_ms
        )
    else:
        // Failed to acquire majority — release all partial acquisitions
        for node in acquired_nodes:
            release_redis_lock(node, LockHandle(resource, unique_token))
        return None

// Redlock release: release on all nodes where lock was acquired
function redlock_release(lock):
    for node in lock.acquired_nodes:
        try:
            release_redis_lock(node, LockHandle(lock.resource, lock.token))
        except NetworkError:
            continue  // Lock will expire via TTL if release fails

Redlock’s correctness depends on the assumption that clock drift across nodes is small relative to the TTL. If clocks drift significantly, the effective validity window may be zero even after a successful majority acquisition. Martin Kleppmann’s critique of Redlock (2016) demonstrates that it cannot guarantee safety in all network partition scenarios. For workloads where lock correctness is absolutely critical, ZooKeeper or etcd-based locks are more appropriate.

ZooKeeper-Based Locks

ZooKeeper provides a coordination service with ephemeral nodes: nodes that are automatically deleted when the client session that created them ends. A crashed client’s session expires, and its ephemeral nodes disappear. This solves the TTL problem: the lock is held for as long as the session is alive, not for a fixed time.

// ZooKeeper lock using ephemeral sequential nodes
// Each lock attempt creates an ephemeral sequential node under /locks/resource/
// ZooKeeper appends a sequence number: /locks/job-42/lock-0000000001

function zk_acquire_lock(zk_client, resource):
    base_path = "/locks/" + resource

    // Create an ephemeral sequential node — ZK assigns sequence number
    my_path = zk_client.create(
        path=base_path + "/lock-",
        data=this_process_id,
        mode=EPHEMERAL_SEQUENTIAL    // deleted when session expires
    )
    // my_path is now something like /locks/job-42/lock-0000000003

    while true:
        // List all lock nodes for this resource
        children = zk_client.get_children(base_path)
        sort(children)  // ZK sequence numbers are sortable

        my_sequence = extract_sequence(my_path)
        min_sequence = extract_sequence(children[0])

        if my_sequence == min_sequence:
            // I am the smallest — I hold the lock
            return ZKLock(path=my_path, resource=resource)

        // I am not the smallest — watch the node immediately before me
        // (not the smallest, to avoid thundering herd)
        predecessor = find_predecessor(children, my_sequence)
        if predecessor does not exist:
            // Predecessor disappeared between listing and watching — retry
            continue

        // Block until predecessor is deleted (lock released or session expired)
        zk_client.wait_for_deletion(
            path=base_path + "/" + predecessor,
            timeout=LOCK_WAIT_TIMEOUT
        )
        // Loop: re-check if I am now the smallest

function zk_release_lock(zk_client, lock):
    // Deleting the ephemeral node releases the lock
    // If the process crashes, ZK session expiry deletes it automatically
    zk_client.delete(lock.path)

The ZooKeeper lock has a key property: there is no TTL to expire prematurely. The lock is held exactly as long as the session is alive. If the process holding the lock is paused for 30 seconds, it still holds the lock — no other process can acquire it. If the process dies, ZooKeeper detects the session expiry (within session timeout, typically 2–10 seconds) and deletes the ephemeral node.

The cost of this design is that ZooKeeper itself must be highly available, which requires running 5 nodes in a quorum configuration. ZooKeeper is a serious operational commitment.

Lock-Free Alternatives: CAS and Optimistic Locking

Many workloads do not need a lock if they can use compare-and-swap (CAS): an atomic operation that updates a value only if it currently matches an expected value.

// Compare-and-swap on a distributed database (e.g., DynamoDB conditional write)
function update_record_with_cas(db, record_id, new_value):
    max_retries = 5
    for attempt in range(max_retries):
        // Read current state including version number
        current = db.get(record_id)
        current_version = current.version

        // Compute new state based on current state
        new_record = compute_new_state(current, new_value)
        new_record.version = current_version + 1

        // Write only if version hasn't changed since we read
        try:
            db.conditional_write(
                record_id=record_id,
                value=new_record,
                condition="version == " + str(current_version)
            )
            return success  // Write succeeded: we were the only writer
        except ConditionalCheckFailed:
            // Another process wrote concurrently — retry with fresh read
            if attempt == max_retries - 1:
                raise TooManyRetriesError()
            backoff(attempt)  // Exponential backoff before retry
            continue

// Optimistic locking: reads don't block; writes fail if concurrent write detected
// Good for: low write contention, short critical sections
// Bad for: high write contention (many retries waste work)

CAS eliminates the need for a lock service entirely. The storage system itself provides the atomicity guarantee. When write contention is low, CAS is faster than acquiring a distributed lock. When contention is high, CAS degrades: many writers retry repeatedly, amplifying load on the database.

The practical rule: use CAS when conflicts are rare and retries are cheap. Use distributed locks when conflicts are common and wasted work is unacceptable.


Read in the book →