WhatsApp processes 100 billion messages per day across 2 billion users. Each message must be delivered within seconds of being sent, even when the recipient’s device is offline, in a country with intermittent connectivity, on a battery-constrained phone. The system must maintain persistent connections for billions of devices simultaneously, route messages to the correct connection, and guarantee delivery without duplicates.
Real-time chat sits at the intersection of several distributed systems challenges: persistent connection management, message ordering, delivery guarantees, and presence detection. Each problem looks tractable in isolation. The difficulty is that they interact. Solving ordering breaks delivery guarantees. Solving delivery guarantees creates duplicate risks. The production architecture is a careful balance between these tensions.
Users send and receive messages. Messages must be delivered in order within a conversation. Delivery must be guaranteed — no message silently dropped. Presence must be accurate — users should see whether a contact is online. The system must handle billions of concurrent connections globally and deliver messages within 1–2 seconds.
A naive implementation uses HTTP polling: the client sends a GET request every second asking “any new messages?” This generates billions of unnecessary requests per second. Most of those requests return empty responses. The server cannot push messages — the client must always ask. Latency is bounded by the polling interval.
HTTP long-polling (holding the request open until a message arrives) improves latency but still requires re-establishing a connection after each message. At billions of concurrent users, the connection setup overhead is prohibitive.
WebSockets solve the push problem: a persistent bidirectional TCP connection that allows both client and server to send data at any time without re-establishing the connection.
WebSocket Connection Management
Each client establishes one WebSocket connection to a chat server (connection layer). The connection layer is a fleet of servers, each maintaining tens of thousands of persistent WebSocket connections.
A connection registry maps user IDs to the chat server holding their WebSocket connection:
connection_registry (Redis):
user:A → server:chat-1
user:B → server:chat-1
user:C → server:chat-2
user:D → server:chat-3
Message Routing
When user A sends a message to user D:
If D is offline, the message is stored in a pending delivery queue. When D reconnects, their client syncs missed messages.
Message Ordering and Sequence Numbers
Messages within a conversation must be delivered in order. Each conversation has a monotonically increasing sequence number, managed by the message service.
Conversation [A, D]:
msg@seq=1: "hey" (from A)
msg@seq=2: "hello" (from D)
msg@seq=3: "how are you" (from A)
Clients track the last received sequence number. On reconnection, they request messages from (last_seq + 1) to get missed messages.
Generating sequence numbers in a distributed system without a central counter risks duplicates or gaps (T9, Consensus). Solutions: a single-writer per conversation (serialises ordering, limits throughput), a distributed counter (Snowflake ID with timestamp + machine ID + sequence), or conflict-free ordering using Lamport timestamps.
Offline Message Delivery
Pending queues have a maximum depth. If a user is offline for 30 days, their pending queue may expire. The client detects a gap in sequence numbers and falls back to full sync.
Presence Detection
Presence — whether a user is online — requires heartbeats. Each client sends a heartbeat every 30 seconds. The server updates the user’s presence timestamp in a shared store.
Presence store (Redis):
user:A:last_seen → 1701284400 (Unix timestamp)
user:A:status → "online"
Presence rules:
online = last_seen < 30 seconds ago
away = last_seen 30s–5min ago
offline = last_seen > 5 minutes ago
Presence is eventually consistent — a user who closes the app without sending a final offline notification stays “online” until their heartbeat TTL expires. Designing users to accept ~30s of presence inaccuracy simplifies the architecture significantly.
AT10 — Synchronous/Asynchronous: Message delivery is synchronous from the sender’s perspective (the sender expects a delivery receipt). Message fan-out (to other devices of the same user, to group chat members) is async. Mixing synchronous delivery guarantees with async fan-out reduces the sender’s wait time while still guaranteeing fan-out eventually.
AT9 — Correctness/Performance: Exactly-once message delivery is correct but expensive — it requires distributed transactions to atomically store the message and mark it delivered. At-least-once delivery is simpler: retry on failure, deduplicate on the client using message IDs. Most chat systems use at-least-once with client-side deduplication.
AT5 — Centralisation/Distribution: The connection registry in Redis is a centralised lookup. If Redis is unavailable, no messages can be routed between servers. Redis Cluster distributes the registry, but each lookup still goes through a Redis call. A gossip protocol for connection routing (each server knows which users other servers hold) eliminates this dependency but is harder to implement correctly.
FM9 — Silent Data Corruption: Message loss. A message stored by the Message Service but not delivered because the connection registry has stale data (the user reconnected to a different server after the registry entry was written) — this message is silently lost unless the sender’s client retries. At-least-once delivery with client-generated message IDs and idempotent storage prevents silent loss.
FM7 — Thundering Herd: A chat server restarts. All clients that were connected to it simultaneously attempt to reconnect. Thousands to millions of simultaneous TCP handshakes and WebSocket upgrades overwhelm the server before it can accept connections. Exponential backoff with jitter on the client side distributes reconnection attempts over time.
FM12 — Split-Brain: The connection registry says user D is on Chat Server 3, but Chat Server 3 has restarted and D has not yet reconnected. Messages are routed to Chat Server 3, which has no WebSocket for D. Message Service must detect this: if Chat Server 3 cannot deliver to D, re-enqueue for offline delivery.
At 10×: the connection layer scales to hundreds of chat servers. The connection registry must handle millions of lookups per second — sharding Redis by user ID distributes this load. Message storage moves from a single database to a sharded append-only log per conversation.
At 100×: geographic distribution. Users in different regions connect to the nearest regional chat cluster. Cross-region message routing adds 100–200ms latency. Real-time chat across regions requires careful engineering of the connection registry and the cross-region message path.
WhatsApp uses XMPP (an XML-based messaging protocol) over persistent TCP connections. Its server architecture is built in Erlang, chosen for its ability to maintain millions of concurrent lightweight processes. The messaging layer is a simple store-and-forward — no database write per message, just in-memory delivery with async persistence.
Slack uses a proprietary WebSocket-based protocol. Messages are stored in a PostgreSQL database and indexed in Elasticsearch. The connection layer is separate from the storage layer. Slack’s threading model requires more complex message ordering than simple chat.
Discord serves gaming communities with very large group chats (servers with 100K+ members). Their architecture diverges from simple 1:1 chat — a single message in a large server fan-outs to all active members. They use a combination of WebSockets and WebRTC for voice/video. Discord’s engineering blog describes their move from MongoDB to Cassandra for message storage, motivated by write amplification at scale.
Concept: Real-Time Chat System
Thread: T4 (Queues) ← Ch 5 (Message Queue) → Ch 15 (Notifications); T9 (Consensus) ← Ch 3 (Distributed KV Store) → Ch 17 (Payment Processing)
Core Idea: Persistent WebSocket connections eliminate polling; a connection registry routes messages between servers; monotonic sequence numbers per conversation maintain ordering; offline delivery queues guarantee no message is lost when recipients are offline.
Tradeoff: AT9 — Correctness vs Performance: exactly-once delivery requires distributed transactions; at-least-once with client-side deduplication provides equivalent user experience at lower cost.
Failure Mode: FM7 — Thundering Herd: simultaneous reconnection of all clients after a server restart overwhelms the restarted server; exponential backoff with jitter distributes reconnection load.
Signal: When users expect sub-second message delivery with ordering guarantees and the system must maintain billions of persistent connections simultaneously.
Maps to: Book 0, Framework 6 (System Archetypes), A2 (Social & Communication)
A chat server holds 50,000 active WebSocket connections. The server restarts. Clients use exponential backoff starting at 1 second with a factor of 2, capped at 60 seconds, with ±50% jitter. (a) How many clients attempt reconnection in the first second? (b) After 3 backoff intervals, what is the range of reconnect times for a given client? (c) What FM code describes what happens if all 50,000 clients reconnect without backoff?
A conversation between users A and B has sequence numbers up to seq=100. A sends a message that is stored as seq=101. B’s client last received seq=98. When B reconnects, what sequence numbers does B request? If seq=99 and seq=100 were from B (A never received them), does A’s client need to sync too?
A complete answer will: (1) design a server-to-server fan-out architecture where the sending user’s WebSocket server publishes the message to a pub/sub broker (e.g., Redis Pub/Sub or Kafka), and each regional server subscribed to that channel delivers the message to its locally connected members — contrast this with 1:1 chat where fan-out is always exactly one connection and no pub/sub coordination is needed, (2) name AT5 (Centralisation/Distribution) for the fan-out architecture decision: centralised fan-out through one server creates a bottleneck at 10,000 concurrent channel sends; distributed pub/sub scales but introduces inter-server message hops that add latency — state the latency budget impact and whether the 2-second target is achievable, (3) name FM12 (network partition) for the cross-region routing risk: if the pub/sub broker is partitioned from a regional server, members in that region miss messages during the partition window — specify whether the design chooses to drop messages (availability) or block delivery (consistency) and name the AT1 tradeoff, and (4) describe offline member handling: messages are persisted to a message store and delivered via push notification (mobile) or on next WebSocket connect (desktop) — specify the message ordering guarantee (e.g., sequence numbers or timestamps) and how the client reconciles missed messages on reconnect.