The Computing Series

Failure Modes in This System

FM9 — Silent Data Corruption: Message loss. A message stored by the Message Service but not delivered because the connection registry has stale data (the user reconnected to a different server after the registry entry was written) — this message is silently lost unless the sender’s client retries. At-least-once delivery with client-generated message IDs and idempotent storage prevents silent loss.

FM7 — Thundering Herd: A chat server restarts. All clients that were connected to it simultaneously attempt to reconnect. Thousands to millions of simultaneous TCP handshakes and WebSocket upgrades overwhelm the server before it can accept connections. Exponential backoff with jitter on the client side distributes reconnection attempts over time.

FM12 — Split-Brain: The connection registry says user D is on Chat Server 3, but Chat Server 3 has restarted and D has not yet reconnected. Messages are routed to Chat Server 3, which has no WebSocket for D. Message Service must detect this: if Chat Server 3 cannot deliver to D, re-enqueue for offline delivery.

Read in the book →