WebSocket Connection Management
Each client establishes one WebSocket connection to a chat server (connection layer). The connection layer is a fleet of servers, each maintaining tens of thousands of persistent WebSocket connections.
A connection registry maps user IDs to the chat server holding their WebSocket connection:
connection_registry (Redis):
user:A → server:chat-1
user:B → server:chat-1
user:C → server:chat-2
user:D → server:chat-3
Message Routing
When user A sends a message to user D:
If D is offline, the message is stored in a pending delivery queue. When D reconnects, their client syncs missed messages.
Message Ordering and Sequence Numbers
Messages within a conversation must be delivered in order. Each conversation has a monotonically increasing sequence number, managed by the message service.
Conversation [A, D]:
msg@seq=1: "hey" (from A)
msg@seq=2: "hello" (from D)
msg@seq=3: "how are you" (from A)
Clients track the last received sequence number. On reconnection, they request messages from (last_seq + 1) to get missed messages.
Generating sequence numbers in a distributed system without a central counter risks duplicates or gaps (T9, Consensus). Solutions: a single-writer per conversation (serialises ordering, limits throughput), a distributed counter (Snowflake ID with timestamp + machine ID + sequence), or conflict-free ordering using Lamport timestamps.
Offline Message Delivery
Pending queues have a maximum depth. If a user is offline for 30 days, their pending queue may expire. The client detects a gap in sequence numbers and falls back to full sync.
Presence Detection
Presence — whether a user is online — requires heartbeats. Each client sends a heartbeat every 30 seconds. The server updates the user’s presence timestamp in a shared store.
Presence store (Redis):
user:A:last_seen → 1701284400 (Unix timestamp)
user:A:status → "online"
Presence rules:
online = last_seen < 30 seconds ago
away = last_seen 30s–5min ago
offline = last_seen > 5 minutes ago
Presence is eventually consistent — a user who closes the app without sending a final offline notification stays “online” until their heartbeat TTL expires. Designing users to accept ~30s of presence inaccuracy simplifies the architecture significantly.