The Computing Series

In Practice — Post-Mortem: The Notification Outage

Scenario: A B2B SaaS product sends email and in-app notifications when a report is ready. On a Tuesday afternoon, notifications stopped delivering for 47 minutes. No page fired for the first 23 minutes. This is what happened.

The system: A notification worker pulls jobs from Redis. For each job, it looks up recipient preferences in Postgres, sends the notification, and marks the job done. Redis is a single node. The Postgres connection pool is capped at 100 connections.


Timeline:

14:02 — A routine deployment restarts the notification worker fleet. All 40 worker instances come back online simultaneously. Each instance immediately begins polling Redis for queued jobs.

14:02–14:04 — Redis, which had been accumulating jobs during the 90-second deployment downtime, delivers the backlog to all 40 workers at once. Each worker fetches a job and opens a Postgres connection to look up recipient preferences. Within 90 seconds, 100 connection slots are exhausted. Remaining workers queue for a slot.

14:04 — New notification jobs arrive from the application tier. Workers are stuck waiting for Postgres connections. Jobs queue in Redis. The queue depth begins growing.

14:08 — The application tier starts receiving timeout errors from the notification worker API. It begins retrying. Retry traffic adds to the job queue in Redis. More workers, now processing retries, contend for the same 100 connection slots.

14:14 — Redis memory usage spikes as the job queue grows without a depth limit. At 14:21, Redis begins rejecting new enqueue operations. The application tier’s notification API starts returning 500 errors to callers.

14:25 — An on-call engineer notices elevated error rates in the application tier dashboard. The connection is not immediately obvious — the notification worker has no dashboard entry.

14:49 — Root cause identified. The notification worker fleet is restarted with staggered startup delays. Postgres connections are released. The queue drains over the next 11 minutes.


Failure mode identification:

FM7 — Thundering Herd (the trigger): All 40 workers reconnected simultaneously after the deployment and immediately began processing the queued backlog. The spike in Postgres connection demand was not a traffic anomaly — it was a structural consequence of simultaneous restart combined with queued work.

Prevention: Jittered startup delays between worker instances. A warm-up period where each instance processes a fraction of full load for 30 seconds before taking on full capacity.

FM3 — Unbounded Resource Consumption (the amplifier): The Postgres connection pool had a maximum of 100, but the Redis job queue had no maximum depth. As workers stalled waiting for connections, the queue grew without limit. When the queue exceeded Redis’s configured memory, Redis began dropping new writes — affecting a system (the application tier) that had no visibility into queue depth.

Prevention: A maximum depth on the job queue with backpressure to the caller — reject new jobs when the queue exceeds N, returning a retriable error rather than silently failing later.

FM2 — Cascading Failures (the propagation): The notification worker’s internal resource exhaustion propagated outward when it began returning errors to the application tier. The application tier’s retry logic added more load to the already-saturated system. The cascade travelled upstream — from the connection pool, through the worker API, into the application tier, and then to end users who saw failed actions.

Prevention: A circuit breaker on the notification worker client in the application tier. After a threshold of consecutive errors (e.g., 10 in 10 seconds), stop sending new requests and return a synthetic success to callers — log the failure, retry asynchronously. This breaks the cascade path before it reaches the end user.


The compounding dynamic: Each failure mode made the next worse. The thundering herd exhausted the connection pool faster than a gradual ramp would have. The unbounded queue caused Redis to become a secondary failure point, expanding the blast radius. Cascading retries amplified the load precisely when the system was least able to absorb it. Remove any one of these failure modes and the incident either does not occur or self-recovers within minutes.

The FM11 footnote: 23 minutes elapsed before the incident was detected. The notification worker emitted no metrics and had no dashboard. FM11 (Observability Blindness) did not cause the outage — but it doubled the recovery time. Any failure mode combined with FM11 becomes a multi-hour incident.

Read in the book →