The Computing Series

Architecture Walkthrough

WebSocket Connection Management

Each client establishes one WebSocket connection to a chat server (connection layer). The connection layer is a fleet of servers, each maintaining tens of thousands of persistent WebSocket connections.

A connection registry maps user IDs to the chat server holding their WebSocket connection:

connection_registry (Redis):
  user:A → server:chat-1
  user:B → server:chat-1
  user:C → server:chat-2
  user:D → server:chat-3

Message Routing

When user A sends a message to user D:

If D is offline, the message is stored in a pending delivery queue. When D reconnects, their client syncs missed messages.

Message Ordering and Sequence Numbers

Messages within a conversation must be delivered in order. Each conversation has a monotonically increasing sequence number, managed by the message service.

Conversation [A, D]:
  msg@seq=1: "hey" (from A)
  msg@seq=2: "hello" (from D)
  msg@seq=3: "how are you" (from A)

Clients track the last received sequence number. On reconnection, they request messages from (last_seq + 1) to get missed messages.

Generating sequence numbers in a distributed system without a central counter risks duplicates or gaps (T9, Consensus). Solutions: a single-writer per conversation (serialises ordering, limits throughput), a distributed counter (Snowflake ID with timestamp + machine ID + sequence), or conflict-free ordering using Lamport timestamps.

Offline Message Delivery

Pending queues have a maximum depth. If a user is offline for 30 days, their pending queue may expire. The client detects a gap in sequence numbers and falls back to full sync.

Presence Detection

Presence — whether a user is online — requires heartbeats. Each client sends a heartbeat every 30 seconds. The server updates the user’s presence timestamp in a shared store.

Presence store (Redis):
  user:A:last_seen → 1701284400  (Unix timestamp)
  user:A:status → "online"

Presence rules:
  online = last_seen < 30 seconds ago
  away   = last_seen 30s–5min ago
  offline = last_seen > 5 minutes ago

Presence is eventually consistent — a user who closes the app without sending a final offline notification stays “online” until their heartbeat TTL expires. Designing users to accept ~30s of presence inaccuracy simplifies the architecture significantly.

Read in the book →