How It Works

Why Scale Changes Design Categorically

A single-machine web server handles 1,000 concurrent connections. It uses a single database. Latency is 10ms. This works.

At 100,000 concurrent connections, the single machine’s memory is exhausted. The database becomes a bottleneck. The solution is not to buy a faster machine — vertical scaling has a ceiling. The solution is horizontal scaling: multiple machines. But multiple machines introduce coordination problems that a single machine never has.

// Single-machine design: works at 1,000 users
function handle_request(request):
    data = database.query(request.user_id)
    return render(data)

// At 100,000 users: database is the bottleneck
// Adding a second database requires a decision:
// Which database does each user's data live in?
// What happens when a request needs data from both?
// These questions do not exist at 1,000 users.

// Horizontal scale introduces coordination problems:
function handle_request(request):
    shard = hash(request.user_id) % num_shards
    data = databases[shard].query(request.user_id)
    return render(data)
// Now: what happens if one shard is down?
// What happens if you need to add a shard?

The point: scale changes the class of problem, not just the magnitude.

The Seven Review Questions as a Scaling Diagnostic

Every production system can be evaluated against seven questions. Any “no” indicates a scaling vulnerability:

function scale_diagnostic(system):
    questions = [
        "Can any single component be removed without total system failure?",
        "Does throughput increase linearly when you add instances?",
        "Is latency bounded at P99 under 2× peak load?",
        "Is state stored outside the application instances?",
        "Does the system degrade gracefully when a dependency is slow?",
        "Is there a mechanism to shed load when capacity is exceeded?",
        "Is failure observable before users report it?"
    ]
    // A "no" to any question = a scaling failure mode waiting to happen.
    // This is not a checklist for correctness — it is a map of where
    // the system will break first as load increases.
    return questions

Measuring Scale: What Matters

// Throughput measurement
throughput_rps = requests_completed / time_window_seconds

// Latency measurement — P50, P95, P99 matter; mean does not
latencies = collect_latency_samples(N)
sort(latencies)
p50 = latencies[N * 0.50]
p95 = latencies[N * 0.95]
p99 = latencies[N * 0.99]
// A system with mean=5ms, p99=500ms is not a "5ms system".
// 1% of users — potentially thousands per second — experience 500ms.

// Load factor — how close to capacity is the system?
load_factor = current_rps / max_tested_rps
// At load_factor > 0.7, the system is approaching its limit.
// At load_factor > 0.9, latency typically begins to spike non-linearly.

// Little's Law — relates concurrency, throughput, and latency:
// N = λ × W
// N = average number of requests in the system
// λ = throughput (requests/second)
// W = average latency (seconds)
// Implication: if latency doubles at the same throughput,
// the number of in-flight requests doubles — consuming 2× memory.

The Cost Dimension

Scale has a fourth axis that constrains the other three: cost. Adding instances increases throughput but costs money. Adding memory reduces latency but costs money. The engineering problem is not “achieve maximum scale” — it is “achieve required scale at acceptable cost.”

// Cost-aware scaling decision
function should_add_instance(current_load_factor, instance_cost, revenue_at_risk):
    if current_load_factor > 0.8:
        // High load: adding an instance prevents outage
        // Outage cost >> instance cost → add instance
        return revenue_at_risk > instance_cost
    else:
        // Low load: adding instance is premature
        return false

Read in the book →

← The Concept Tradeoffs →