A single-machine web server handles 1,000 concurrent connections. It uses a single database. Latency is 10ms. This works.
At 100,000 concurrent connections, the single machine’s memory is exhausted. The database becomes a bottleneck. The solution is not to buy a faster machine — vertical scaling has a ceiling. The solution is horizontal scaling: multiple machines. But multiple machines introduce coordination problems that a single machine never has.
// Single-machine design: works at 1,000 users
function handle_request(request):
data = database.query(request.user_id)
return render(data)
// At 100,000 users: database is the bottleneck
// Adding a second database requires a decision:
// Which database does each user's data live in?
// What happens when a request needs data from both?
// These questions do not exist at 1,000 users.
// Horizontal scale introduces coordination problems:
function handle_request(request):
shard = hash(request.user_id) % num_shards
data = databases[shard].query(request.user_id)
return render(data)
// Now: what happens if one shard is down?
// What happens if you need to add a shard?
The point: scale changes the class of problem, not just the magnitude.
Every production system can be evaluated against seven questions. Any “no” indicates a scaling vulnerability:
function scale_diagnostic(system):
questions = [
"Can any single component be removed without total system failure?",
"Does throughput increase linearly when you add instances?",
"Is latency bounded at P99 under 2× peak load?",
"Is state stored outside the application instances?",
"Does the system degrade gracefully when a dependency is slow?",
"Is there a mechanism to shed load when capacity is exceeded?",
"Is failure observable before users report it?"
]
// A "no" to any question = a scaling failure mode waiting to happen.
// This is not a checklist for correctness — it is a map of where
// the system will break first as load increases.
return questions
// Throughput measurement
throughput_rps = requests_completed / time_window_seconds
// Latency measurement — P50, P95, P99 matter; mean does not
latencies = collect_latency_samples(N)
sort(latencies)
p50 = latencies[N * 0.50]
p95 = latencies[N * 0.95]
p99 = latencies[N * 0.99]
// A system with mean=5ms, p99=500ms is not a "5ms system".
// 1% of users — potentially thousands per second — experience 500ms.
// Load factor — how close to capacity is the system?
load_factor = current_rps / max_tested_rps
// At load_factor > 0.7, the system is approaching its limit.
// At load_factor > 0.9, latency typically begins to spike non-linearly.
// Little's Law — relates concurrency, throughput, and latency:
// N = λ × W
// N = average number of requests in the system
// λ = throughput (requests/second)
// W = average latency (seconds)
// Implication: if latency doubles at the same throughput,
// the number of in-flight requests doubles — consuming 2× memory.
Scale has a fourth axis that constrains the other three: cost. Adding instances increases throughput but costs money. Adding memory reduces latency but costs money. The engineering problem is not “achieve maximum scale” — it is “achieve required scale at acceptable cost.”
// Cost-aware scaling decision
function should_add_instance(current_load_factor, instance_cost, revenue_at_risk):
if current_load_factor > 0.8:
// High load: adding an instance prevents outage
// Outage cost >> instance cost → add instance
return revenue_at_risk > instance_cost
else:
// Low load: adding instance is premature
return false