A batch processing system uses a Redis Redlock to ensure only one worker processes each job. Workers sometimes pause for up to 8 seconds during garbage collection. The current TTL is 5 seconds. Describe two separate failure scenarios caused by this TTL, and explain how fencing tokens would change the outcome of each scenario.
A system uses ZooKeeper ephemeral nodes for distributed locking. The ZooKeeper session timeout is set to 2 seconds. A network partition isolates a lock holder for 3 seconds before recovering. Walk through what happens to the lock during and after the partition, and identify which process holds the lock after recovery.
A database has two types of writes: high-frequency low-conflict user profile updates (thousands of users, each user’s record updated independently) and low-frequency high-conflict inventory reservation (hundreds of workers competing for the same stock records). Choose between CAS and distributed locks for each workload and justify the choice in terms of contention and retry cost.
A complete answer will: (1) specify the lock acquisition protocol using a quorum of 3 of 5 data centres (Redlock or equivalent) with a fencing token that increments monotonically on each acquisition, (2) identify FM12 (network partition causing a lock holder to lose quorum and a second client to acquire the lock) as the residual failure mode that the design cannot fully eliminate, (3) address the AT1 tradeoff between lock TTL length (shorter = faster recovery from holder failure, higher risk of lock expiry during slow operations) and TTL that is too long (holder can complete but recovery is slow), and (4) specify what the application does when lock acquisition fails — return an error to the caller rather than waiting or retrying indefinitely — and justify this as the correct behaviour for a payment system.
A complete answer will: (1) identify the correct partitioning strategy — assign jobs to workers by hashing job_id to worker_id, so each worker owns a disjoint partition and no lock is needed, (2) identify FM1 (single point of failure — if a worker dies, its partition stalls until reassigned) as the new failure mode the partitioning strategy introduces in place of the lock, (3) address the AT5 tradeoff between centralised locking (all workers can process any job, higher contention) and partitioned ownership (no contention, worker failure blocks a partition), and (4) specify the invariant the design preserves — exactly-once processing per job — and describe the rebalancing protocol when a worker joins or leaves.
A complete answer will: (1) correctly explain why the per-user distributed lock creates a serialisation bottleneck — each request must acquire and release a lock on a shared resource, limiting throughput to 1 request per lock RTT per user — and quantify the latency impact at high concurrency, (2) design the Redis token bucket using EVALSHA or a Lua script that atomically reads the bucket count and decrements it in a single round-trip, with the approximate error bounded by the number of API servers that can race before the next Redis sync, (3) identify FM6 (a single Redis shard serving all user buckets becomes a hotspot for high-traffic users) as the failure mode, and (4) define “approximate” precisely — up to N×limit requests may be accepted in a window where N is the number of API servers in the worst-case race, before the bucket is synchronised.