Load Balancing

Introduction

In 2016, a single misconfigured DNS record took down a major cloud provider’s load balancer for four hours. During that time, every application behind it was unreachable. The load balancer had been the indirection layer between the internet and thousands of servers — the one component whose failure made every other component irrelevant.

Load balancing solves a straightforward problem: you have more requests than one server can handle, and you have multiple servers. Distribute the requests. The solution is straightforward; the failure modes are not. A load balancer that distributes requests evenly to servers that are already failing makes things worse, not better. A load balancer that is itself a single point of failure defeats the purpose of having multiple servers.

Thread Activation

This chapter activates T11 (Feedback) at the infrastructure level. In Book 1, Chapter 35, feedback loops controlled iterative algorithms. In Book 2, Chapter 19, measurement and adjustment were the basis of benchmarking. A load balancer with health checks is a feedback loop: measure server health → adjust routing → measure again. The chapter also continues T8 (Divide and Conquer): distributing load across N servers is the infrastructure form of dividing a problem into N independent subproblems. In Book 4, the load balancer appears as a standard component in every distributed system design.

The Concept

A load balancer sits between clients and a pool of servers. It receives every incoming request and forwards it to one server in the pool. The forwarding decision is the load balancing algorithm.

Layer 4 load balancing (transport layer): decisions are made based on IP addresses and TCP/UDP ports. The load balancer does not inspect the request content — it routes packets. Fast, low overhead, but cannot make content-aware decisions.

Layer 7 load balancing (application layer): decisions are made based on HTTP headers, URL paths, cookies, or request content. Slower than Layer 4 but enables routing based on request type: /api requests go to one pool, /static requests go to another.

The choice between Layer 4 and Layer 7 is AT6 — Generality vs. Specialisation. Layer 4 is general and fast. Layer 7 is specialised and flexible.

How It Works

Routing Algorithms

// Algorithm 1: Round-robin
// Each request goes to the next server in sequence.
// Simple. Assumes all servers are equally capable and equally loaded.
function round_robin(servers, request):
    index = request_counter % length(servers)
    request_counter += 1
    return servers[index]

// Algorithm 2: Least connections
// Each request goes to the server with fewest active connections.
// Better for heterogeneous workloads where requests have varying duration.
function least_connections(servers, request):
    return server with minimum active_connections in servers

// Algorithm 3: IP hash (sticky routing)
// Each request from the same client IP goes to the same server.
// Required when sessions are stored on the server (stateful services).
// Breaks down if client IP changes (mobile, NAT) or server fails.
function ip_hash(servers, request):
    index = hash(request.client_ip) % length(servers)
    return servers[index]

// Algorithm 4: Consistent hashing (from Book 2, Ch 15)
// Server membership changes reroute only K/N keys, not all keys.
// Used for distributed caches: same key always maps to same server,
// minimising cache invalidation when the pool changes.
function consistent_hash(ring, request):
    position = hash(request.key)
    return ring.get_node(position)  // next node clockwise on ring

Health Checks: The Feedback Loop

A load balancer without health checks is dangerous — it will continue routing to a failing server, sending requests into a black hole.

// Health check loop — runs continuously for each server
function health_check_loop(server, interval_seconds):
    while true:
        sleep(interval_seconds)
        result = probe(server, timeout=2s)
        if result == HEALTHY:
            server.state = HEALTHY
            server.consecutive_failures = 0
        else:
            server.consecutive_failures += 1
            if server.consecutive_failures >= 3:
                server.state = UNHEALTHY
                remove_from_pool(server)
                alert("Server " + server.id + " marked unhealthy")

// Probe types:
// TCP probe: can we establish a TCP connection? (Layer 4 check)
// HTTP probe: does GET /health return 200? (Layer 7 check)
// Custom probe: does the application report it is ready to serve?

// Hysteresis: re-adding a server requires N consecutive successes,
// not just one success, to prevent flapping.
function recovery_check_loop(server, interval_seconds):
    while server.state == UNHEALTHY:
        sleep(interval_seconds)
        result = probe(server, timeout=2s)
        if result == HEALTHY:
            server.consecutive_successes += 1
            if server.consecutive_successes >= 3:
                server.state = HEALTHY
                add_to_pool(server)

Avoiding the Load Balancer as a SPOF

The load balancer must not itself be a single point of failure. Two patterns:

// Pattern 1: Active-passive pair
// Primary handles all traffic. Secondary monitors primary.
// If primary fails, secondary takes over using a shared virtual IP (VIP).

function primary_health_monitor(primary, secondary, virtual_ip):
    while true:
        if primary is UNHEALTHY:
            reassign_virtual_ip(virtual_ip, to=secondary)
            secondary.state = ACTIVE
            alert("Failover: secondary now active")
        sleep(1s)

// Pattern 2: DNS-level load balancing
// Multiple A records for the same domain.
// Clients connect to different IPs.
// Simpler but DNS TTL means failover is slow (seconds to minutes).
// Used for geographic distribution, not fast failover.

// Pattern 3: Anycast routing
// Multiple servers announce the same IP prefix via BGP.
// Routers direct packets to the geographically nearest server.
// Used by CDNs and DNS providers for global scale.

Connection Draining

When a server is removed from the pool, existing connections should not be abruptly terminated:

function graceful_remove(server, drain_timeout_seconds):
    server.state = DRAINING
    // No new connections routed to this server
    // Existing connections are allowed to complete
    deadline = now() + drain_timeout_seconds
    while server.active_connections > 0 and now() < deadline:
        sleep(1s)
    // After deadline: forcibly close remaining connections
    server.state = REMOVED

Tradeoffs

AT6 — Generality vs. Specialisation

Layer 4 load balancing is general: it works for any TCP/UDP protocol and has minimal overhead. Layer 7 is specialised: it understands HTTP and can make content-aware decisions — route /api/v2 to new servers, route /api/v1 to old servers during a migration. The cost of specialisation is overhead: each Layer 7 connection requires terminating TLS, parsing HTTP headers, and making a routing decision before forwarding.

AT5 — Centralisation vs. Distribution

A centralised load balancer is simple to configure and monitor. It is also a bottleneck: all traffic must pass through it. At very high throughput (millions of RPS), a single load balancer becomes the bottleneck. The solution is to distribute: multiple load balancers behind DNS round-robin or anycast. Distributed load balancing eliminates the bottleneck but makes configuration, health state, and session affinity harder to manage consistently.

Where It Fails

FM1 — Single Point of Failure

A load balancer that is not itself redundant becomes the single point of failure it was meant to prevent. The failure mode: the load balancer crashes or becomes unresponsive; all traffic to all servers behind it stops. The symptom is a total outage — all services appear down simultaneously — with no obvious cause in application logs because the requests never reach the servers.

FM6 — Hotspotting

IP-hash routing creates hotspots when client traffic is not uniformly distributed. If 30% of traffic comes from a corporate proxy with one IP, 30% of all requests go to one server. That server is overloaded; others are underloaded. The load balancer is routing evenly by IP hash, but the effective load distribution is highly skewed.

Real Systems

HAProxy is the most widely deployed open-source load balancer. It supports Layer 4 and Layer 7 routing, dozens of balancing algorithms, and health checks with configurable thresholds. It is used by GitHub, Stack Overflow, and Tumblr. Its architecture is single-threaded event-driven (like Nginx), which gives it extremely high throughput on a single core with low latency.

AWS Application Load Balancer (ALB) is a managed Layer 7 load balancer. It routes based on URL path, HTTP headers, query strings, and source IP. It integrates with AWS auto-scaling: as new instances are registered, they are automatically added to the routing pool. Health check failures automatically remove instances. The managed version eliminates the operational burden of maintaining load balancer redundancy.

Kubernetes’ kube-proxy implements service load balancing inside a cluster using iptables or IPVS rules. Every node runs kube-proxy, which maintains routing rules so that traffic to any service ClusterIP is distributed across the pods that back it. This is load balancing at the network layer inside the cluster, fully distributed — there is no central load balancer.

Concept: Load Balancing

Thread: T11 (Feedback) ← benchmarking and measurement (Book 2, Ch 19) → autoscaler control loops (Book 3, Ch 22); T8 (Divide & Conquer) ← distributing load across N servers is the infrastructure form of divide and conquer

Core Idea: A load balancer distributes incoming requests across a pool of servers using a routing algorithm (round-robin, least-connections, consistent-hashing) and continuously measures server health via probes. The health check loop is a feedback system: unhealthy servers are removed from the pool; recovered servers are re-added after hysteresis. The load balancer must itself be redundant to avoid becoming the SPOF it was designed to prevent.

Tradeoff: AT6 — Generality vs. Specialisation (Layer 4 is fast and general; Layer 7 enables content-aware routing at the cost of per-connection parsing overhead — choose based on whether routing decisions need to inspect request content)

Failure Mode: FM1 — Single Point of Failure (an unreplicated load balancer is the single point of failure for every service behind it; active-passive failover or anycast routing eliminates this; the failure symptom is total simultaneous outage of all downstream services)

Signal: When multiple services go down simultaneously with no error logs — the load balancer or a shared network component has failed. When one server in a pool is consistently receiving 3–5× the load of others — the routing algorithm has a hotspot. When P99 latency spikes after a deployment and then recovers — a server was briefly unhealthy and was not removed from the pool fast enough.

Maps to: Book 0, Framework 8 (Infrastructure Components — Load Balancer), Framework 3 (Failure Modes — Single Point of Failure)

Exercises

Level 2 — Apply

A service has a pool of 5 servers. A load balancer uses round-robin routing. Server 3 goes down. The health check interval is 30 seconds; the unhealthy threshold is 3 consecutive failures.

How long does it take for server 3 to be removed from the pool? How many requests are sent to the failing server during this window? At 1,000 RPS total, how many requests fail?
The team reduces the health check interval to 5 seconds. What is the new failure window? What is the operational cost of more frequent health checks?
After server 3 is removed, traffic is distributed across 4 servers. One server was already at 80% capacity. At 1,000 RPS with round-robin, how much additional load does the 80% server receive?

Level 3 — Design

A global e-commerce platform receives 500,000 RPS at peak. The platform has data centres in 3 regions (US, EU, Asia). Users must be routed to the nearest region to minimise latency. Within each region, load must be distributed across 50 application servers.

Design the two-tier load balancing architecture. What handles global routing? What handles within-region distribution? Name the algorithm used at each tier.
A deployment requires rolling 10 servers out of service simultaneously in each region. Design the drain procedure. How does the remaining capacity change during the drain, and what is the maximum safe drain size per region without exceeding 90% capacity on remaining servers?
A DDoS attack generates 2,000,000 RPS from 10,000 distinct IP addresses. The platform’s capacity is 500,000 RPS. Design a rate-limiting layer that integrates with the load balancer to protect backend servers. Name every tradeoff using AT notation.

A complete answer will: (1) correctly design the two-tier architecture — Anycast DNS or GeoDNS for global routing, weighted round-robin or least-connections within each region — and compute the maximum safe drain size numerically (50 servers × 90% = 45 servers max active, so 5 can drain per region at peak load), (2) identify FM3 (resource exhaustion — 4× capacity DDoS overwhelming backend servers) and FM6 (hotspot from 10,000 attacker IPs routing to the same region) as the failure modes the rate-limiting layer must address, (3) address the AT9 tradeoff between per-IP rate limiting (precise, distributed state required) and connection-count limiting at the load balancer (coarse, stateless, but blocks legitimate users behind shared IPs), and (4) specify the state storage for rate limit counters — shared across the 50 regional servers — and the propagation mechanism.

Read in the book →