The payment service was not down. Requests were getting through. Responses were just taking twelve seconds instead of two hundred milliseconds. Within four minutes, the checkout service had no free threads. Within six, the API gateway followed. The entire platform went offline because one upstream service slowed down.
Nobody paged on the payment service. It was healthy by every metric they tracked. The cascade killed three services that had nothing wrong with them.
Dead Is Better Than Slow
This sounds backwards. It is not.
A dead service fails immediately. The caller gets an error. It moves on. It frees the thread. A slow service keeps the connection open. The caller waits. The thread blocks. No other request can use that thread.
Dead upstream:
Caller → [TIMEOUT: 10ms] → error returned → thread freed
Slow upstream:
Caller → waits 12 seconds → thread blocked for 12 seconds
→ 50 other requests queue behind it
→ thread pool exhausted
→ caller now looks dead to its callers
A dead service fails fast. A slow service takes everything else down with it.
That is the mechanism. Not load. Not a bug in the downstream services. Resource starvation caused by holding connections to something that refuses to respond quickly.
Thread Exhaustion: The Actual Kill
Most web services use a thread pool or connection pool to handle requests. The pool has a fixed size — say, two hundred threads. Each thread handles one request at a time.
Under normal conditions, threads process requests quickly and return to the pool. Under FM2 conditions, threads block waiting for the slow upstream. They do not return. New requests arrive. They grab threads. Those threads also block. The pool fills.
Thread Pool (200 threads)
Normal:
[▓▓▓▓▓░░░░░░░░░░░░░░░] 20 in use, 180 free
After 2 min of slow upstream:
[▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓] 200 in use, 0 free
New request arrives:
→ no thread available
→ queued or rejected
→ caller sees service as unresponsive
Once the pool is exhausted, the service appears down. Its callers start waiting on it. Their thread pools exhaust. The failure moves upstream.
How It Spreads
FM2 does not require a direct dependency. It propagates through any chain where a slow response causes resource starvation.
slow
Payment ←─────────── Checkout ←────── API Gateway ←──── User
Step 1: Payment responds slowly. Checkout threads block.
Step 2: Checkout looks slow. API Gateway threads block.
Step 3: API Gateway looks unresponsive. Users see 502 or timeout.
All three services are functionally healthy.
Only Payment is degraded.
The cascade spreads because coupling exists without protection. Each service trusts its upstream to respond in bounded time. None of them enforce that bound.
Timeouts Help But Are Not Enough
Adding timeouts is the obvious first step. If payment takes more than two seconds, give up. That frees the thread after two seconds instead of twelve. The damage rate slows.
But timeouts alone do not stop the cascade. If payment is slow for ten minutes and every checkout request times out after two seconds, checkout is still spending two seconds on every request that touches payment. If traffic is high, threads still exhaust. Slower, but the same outcome.
The thread pool fills based on: (requests per second) x (time per request). A two-second timeout still allows one hundred concurrent blocked threads if fifty requests per second hit a slow payment service.
Circuit Breakers: The Structural Fix
A circuit breaker sits between the caller and the upstream. It tracks how requests are going. When failures or timeouts exceed a threshold, the breaker opens. Subsequent calls to the upstream fail immediately — no connection attempt, no wait.
Circuit Breaker States
CLOSED (normal):
Caller → [breaker: closed] → Upstream
All calls pass through. Failures counted.
OPEN (tripped):
Caller → [breaker: open] → immediate error returned
No calls reach upstream. Thread freed instantly.
HALF-OPEN (recovery probe):
One call let through. If it succeeds, breaker closes.
If it fails, breaker stays open.
When payment slows, the breaker trips after a few failures. All subsequent checkout requests to payment return immediately with an error. Checkout threads are free in microseconds. The cascade stops.
The breaker does not fix payment. It isolates the failure. Checkout can return a degraded response — skip payment, show an error, queue the transaction for retry — instead of exhausting its thread pool.
The Principle
Every service-to-service call is a trust relationship. You trust the upstream to respond in bounded time. FM2 happens when that trust is unprotected.
Timeouts bound how long you wait. Circuit breakers stop you from waiting at all when the upstream is sick. Bulkheads — separate thread pools per upstream dependency — contain the starvation to one section of your capacity.
None of these fix a slow service. They protect the services that depend on it.
The System That Survives
Cascading failures are not caused by load. They are caused by coupling without protection. The services that died in that checkout cascade were not overwhelmed. They were starved — waiting on something that would not answer, with no mechanism to stop waiting.
For each upstream dependency: what is the timeout? What trips the circuit breaker? What does the caller do when it trips? Answer those three questions for every dependency edge in your system, and FM2 loses most of its teeth.