Exercises

Level 2 — Apply

A service has a pool of 5 servers. A load balancer uses round-robin routing. Server 3 goes down. The health check interval is 30 seconds; the unhealthy threshold is 3 consecutive failures.

How long does it take for server 3 to be removed from the pool? How many requests are sent to the failing server during this window? At 1,000 RPS total, how many requests fail?
The team reduces the health check interval to 5 seconds. What is the new failure window? What is the operational cost of more frequent health checks?
After server 3 is removed, traffic is distributed across 4 servers. One server was already at 80% capacity. At 1,000 RPS with round-robin, how much additional load does the 80% server receive?

Level 3 — Design

A global e-commerce platform receives 500,000 RPS at peak. The platform has data centres in 3 regions (US, EU, Asia). Users must be routed to the nearest region to minimise latency. Within each region, load must be distributed across 50 application servers.

Design the two-tier load balancing architecture. What handles global routing? What handles within-region distribution? Name the algorithm used at each tier.
A deployment requires rolling 10 servers out of service simultaneously in each region. Design the drain procedure. How does the remaining capacity change during the drain, and what is the maximum safe drain size per region without exceeding 90% capacity on remaining servers?
A DDoS attack generates 2,000,000 RPS from 10,000 distinct IP addresses. The platform’s capacity is 500,000 RPS. Design a rate-limiting layer that integrates with the load balancer to protect backend servers. Name every tradeoff using AT notation.

A complete answer will: (1) correctly design the two-tier architecture — Anycast DNS or GeoDNS for global routing, weighted round-robin or least-connections within each region — and compute the maximum safe drain size numerically (50 servers × 90% = 45 servers max active, so 5 can drain per region at peak load), (2) identify FM3 (resource exhaustion — 4× capacity DDoS overwhelming backend servers) and FM6 (hotspot from 10,000 attacker IPs routing to the same region) as the failure modes the rate-limiting layer must address, (3) address the AT9 tradeoff between per-IP rate limiting (precise, distributed state required) and connection-count limiting at the load balancer (coarse, stateless, but blocks legitimate users behind shared IPs), and (4) specify the state storage for rate limit counters — shared across the 50 regional servers — and the propagation mechanism.

Read in the book →

← Real Systems