A notification is a promise. When a user takes an action — places an order, receives a message, gets mentioned in a comment — the platform promises to tell them. Break that promise silently and nobody notices. Keep it twice and the user is annoyed. Keep it at 3 AM when they have quiet hours enabled and the user is angry. And when the promise must travel through four channels — push, email, SMS, in-app badge — each with its own third-party integration, failure mode, and rate limit, the simplicity of "send a notification" evaporates entirely.
The notification system is one of the most integration-heavy components in any platform. Its architecture is dominated by one shape: fan-out.
Why the Obvious Design Fails
The naive approach sends notifications synchronously inside the request handler that triggered the event. An order is placed; the handler calls APNS, sends an email, sends an SMS, then returns to the client. This fails three ways.
First, third-party API calls add hundreds of milliseconds to the request — users experience slow order placement because of notification delivery. Second, if APNS is temporarily down, the order request itself fails — a downstream notification failure has propagated into the primary business operation. Third, synchronous delivery leaves no layer to intercept and hold a notification, so rate limiting and quiet-hours logic have nowhere to live.
The Event-Driven Architecture
The fix decouples the notification from the thing that triggered it. The Order Service publishes an event to Kafka and returns to the client immediately. A separate Notification Service consumes those events asynchronously and fans out to channel-specific workers — a push worker calling APNS/FCM, an email worker calling SMTP, an SMS worker calling Twilio. A failure in notification delivery can no longer block an order confirmation.
This is AT10 (Synchronous vs Asynchronous) chosen deliberately: async delivery decouples notification failures from primary operations. The cost is end-to-end latency — for a non-priority notification, the path event → Kafka → notification service → channel queue → channel worker → APNS can take minutes if any hop is slow.
Fan-out — one event becomes many deliveries
Order Service
│ publish event
▼
[ Kafka ]
│
Notification Service ── fetch prefs + devices (on-demand)
│
├──▶ push worker ──▶ APNS ──▶ iPhone
├──▶ push worker ──▶ FCM ──▶ iPad
├──▶ email worker ──▶ SMTP ──▶ inbox
└──▶ sms worker ──▶ Twilio ╳ (not in prefs)
One Event, Many Deliveries
The fan-out is not one event to one notification. It is one event to many deliveries. A user with an iPhone and an iPad receives a push on both. If they also want email, that is a third delivery. Channel routing resolves it:
- Fetch user preferences:
[push_ios, email]— no SMS. - Fetch user devices:
[iPhone-A, iPad-B]. - Generate deliveries: push→iPhone-A, push→iPad-B, email→address.
- Enqueue each delivery to its channel worker.
Preferences and device tokens are fetched at delivery time, not embedded in the event — that is AT4 (Precomputation vs On-Demand). If a user changes preferences between event publication and delivery, the on-demand fetch gets the current value; an event carrying a stale preference snapshot would not.
At-Least-Once, Without Duplicates
Push APIs do not guarantee delivery — devices go offline, tokens expire, networks time out. At-least-once delivery means retry on failure, with exponential backoff and a maximum retry count. But a retry must not deliver the notification twice. Each delivery carries a globally unique notification ID; the channel worker checks that ID (an idempotent check in Redis) before sending and skips if it was already delivered.
Two more constraints ride on top. Rate limiting — per user, per type, per window — stops a cascading event from dumping 500 notifications on one person in an hour. Quiet hours — non-urgent notifications are batched and held until the window opens, while CRITICAL notifications (fraud alerts, 2FA codes) bypass both quiet hours and rate limits entirely.
Where It Fails: The Thundering Herd
The signature failure of a notification system is FM7 (Thundering Herd). A platform-wide event — a popular product launches, a major news event, an outage that itself cascades into alerts — triggers millions of notifications simultaneously. Worker queues overflow; the SMS gateway rate-limits your entire platform. The defence is priority queuing: critical notifications are processed first, and backpressure on non-critical queues stops them starving the critical ones.
Two failures travel with it. FM3 (Unbounded Resource Consumption) — if APNS is down for hours, retry queues grow without limit, each retry multiplying queue depth; dead-letter queues with a maximum retry count bound the growth. And FM5 (Latency Amplification) — the multi-hop async path means a critical notification can take minutes, which is unacceptable for a 2FA code, so critical notifications need a synchronous fast path that bypasses the async pipeline.
Real Systems
Firebase Cloud Messaging is the dominant Android push service, offering topic-based fan-out — one send to millions of subscribers — and handling token management and retry. AWS SNS is a managed fan-out service supporting APNS, FCM, SMTP, SMS, and HTTP endpoints from a single API, with built-in retry and dead-lettering. Twilio abstracts global SMS carrier integrations and rate-limits per country, since SMS throughput limits differ by jurisdiction.
The One Sentence
A notification system is one triggering event fanning out to many channel deliveries, and every hard part — async decoupling so a dead push service cannot fail an order, on-demand preference fetch so deliveries are never stale, deduplication so at-least-once never means twice, and priority queuing so a thundering herd cannot bury a fraud alert — exists because "send a notification" was never actually simple.
Concept: A notification system is one triggering event fanning out to many channel-specific deliveries.
Core Idea: An event-driven pipeline — publish to Kafka, consume asynchronously, resolve preferences and devices on-demand, deduplicate by notification ID, enforce priority.
Tradeoff: AT10 — Synchronous vs Asynchronous: async delivery decouples notification failure from the order, at the cost of multi-hop end-to-end latency.
Failure Mode: FM7 — Thundering Herd: a platform-wide event triggers millions of notifications at once, overflowing worker queues.
Signal: When a delivery path can dump 500 notifications on one user, it needs rate limiting, quiet hours, and priority queuing.
Series: Book 4, Ch 15