AT2 Latency vs Throughput: The Tradeoff

A team measured their API at 5 milliseconds mean latency and called the system fast. Then a colleague pulled the P99 number: 2,000 milliseconds. One request in a hundred took two full seconds. The system was not a "5ms system" — it never had been. It was a system with a 5ms mean and a 2-second tail, and those two numbers describe different machines.

Scale has three independent axes — throughput, latency, and storage. Throughput and latency are the two ways a system can be "fast," and past a certain point they pull against each other. Engineers who do not name which one they are buying end up buying neither.

Two Different Questions

Throughput is the number of requests a system handles per unit time — requests per second. A system that handles 1,000 RPS at normal load may not handle 10,000 RPS without architectural changes. It answers: how many users can the system serve at once?

Latency is the time to handle one request, from initiation to completion, in milliseconds. And the metric that matters is not the mean — it is P99, the 99th percentile. It answers: how long does the unlucky user wait?

Latency:    one request, start → finish
            |--------- 5 ms ---------|

Throughput: requests completed per second
            ▓ ▓ ▓ ▓ ▓ ▓ ▓ ▓ ▓ ▓  →  2,000 / sec

These are independent axes. A single-threaded server answering each request in 5 ms but handling one at a time caps near 200 RPS — low latency, low throughput. A batch pipeline taking ten minutes per job but running ten thousand in parallel — high latency, high throughput. The mistake is treating "fast" as a single number.

Why The Mean Lies

A system with 5ms mean and 2,000ms P99 latency is not a 5ms system. That is the single most important sentence about measuring latency, and it is the one most often skipped.

If the system handles 10,000 RPS, the P99 means 100 requests every second experience the 2-second tail. That is not a rounding error. That is potentially thousands of users per minute getting a response so slow they assume the system is broken. The mean hid them because the mean averages them away.

collect latency samples, sort them:
  p50 = sample at 50%   → 5 ms     "typical"
  p95 = sample at 95%   → 180 ms   "getting slow"
  p99 = sample at 99%   → 2,000 ms "this is the real system"

Measure the tail, or you are measuring a system that does not exist.

Why They Trade Off

The trade-off appears the moment a resource is shared.

Take a thread pool. Each request needs a thread. With 100 threads and 40ms per request, the system tops out near 2,500 RPS. Want more throughput? Add threads. But threads contend for CPU, cache, and memory bandwidth. Past the core count, more threads mean context switching — and every request now spends time waiting to be scheduled instead of running. Latency climbs.

Threads = cores (8):   latency 40 ms,  throughput  200/s
Threads = 64:          latency 55 ms,  throughput 1,150/s
Threads = 512:         latency 340 ms, throughput 1,500/s
                              ↑               ↑
                       latency wrecked   throughput barely moved

This is the shape of AT2. Each unit of throughput past the saturation point costs disproportionately more latency. The system at 512 threads is not broken — it is doing exactly what you asked: maximising concurrent work at the expense of per-request speed.

Little's Law makes the coupling precise. The number of in-flight requests N relates to throughput λ and latency W as N = λ × W. Hold throughput constant and double latency, and the number of in-flight requests doubles — consuming twice the memory. You cannot move one axis without paying somewhere.

Latency Amplification: The Failure Mode

The danger in chasing throughput is FM5 — latency amplification. Throughput tuning tends to add queues, buffers, and batch stages. Each one is a place a request can wait.

A request crossing five services, each with a 50ms buffer, accumulates 250ms of pure waiting before any real work happens. Under load, those buffers fill, and the wait grows. P50 still looks fine. P99 explodes, because that request sat behind a full buffer at every hop.

Throughput-optimised pipeline under load:
  service A buffer → B buffer → C buffer → D buffer
       full           full        full       full
  P50: 120 ms                     P99: 2,400 ms

A throughput-tuned system that nobody load-tested for tail latency passes every benchmark and still feels broken to the unlucky user. The amplification is invisible until the buffers are full.

How to Choose

The decision is not technical first. It is about who is waiting.

Optimise for latency when a human is blocked on the result. Page loads, search, checkout, autocomplete. Here a 2-second request is a failure even if the system handles a million per second. Keep queues shallow. Avoid batching on the request path. Accept lower hardware utilisation as the price.

Optimise for throughput when no one waits on an individual item. Log ingestion, analytics rollups, video encoding, nightly reconciliation. Here per-item latency is irrelevant; cost per million items is everything. Batch aggressively. Run deep queues. Saturate every core.

Scale does not create this tradeoff — it forces you to make it explicitly. A single-machine system can silently accept poor latency and low throughput because at 1,000 users neither matters. At 1,000,000 users, each matters enough to break the system. Scale removes the option of deferring the decision.

The One Sentence

You cannot make a request both arrive instantly and share the machine with ten thousand others — past saturation, every drop of throughput is paid for in latency, and every drop of latency is paid for in throughput.

Before tuning anything, answer one question: is a human waiting on this exact request? If yes, you are buying latency — and you measure P99, not the mean. If no, you are buying throughput. Name the purchase before you make it, because the system will charge you for the other one regardless.