The hardest outages to debug look like everything failed at once. The database, the cache, the API, the frontend — all red at the same time. The temptation is to declare a widespread incident and start fixing everything simultaneously. The reality is almost always simpler: one thing failed, and everything downstream failed because of it.
This is the dependency chain. Understanding it is the difference between an engineer who panics during an outage and one who stays calm, identifies the root cause, and stops the cascade before it spreads.
What Is a Dependency Chain?
A dependency chain is the sequence of services, components, or processes that depend on each other to function. If Service A calls Service B, and Service B calls Service C, then A → B → C is a dependency chain.
When C fails, B fails (because C cannot respond). When B fails, A fails (because B cannot respond). The failure travels up the chain. From the outside, A, B, and C all appear to be failing at the same time. The root cause is C alone.
Client → Service A → Service B → Service C → Database
Database goes down:
Service C: timeout (dependency unavailable)
Service B: timeout (C not responding)
Service A: timeout (B not responding)
Client: error (A not responding)
Symptom: "everything is down"
Root cause: database
Why Dependency Chains Cause Cascading Failures
A single failure becomes a cascading failure through three mechanisms:
1. Resource exhaustion upstream
When Service C is slow (not dead — slow), requests to it pile up waiting for responses. Service B holds open connections to C while waiting. Those connections consume threads or goroutines. If B runs out of threads, new requests from A cannot be processed. B becomes unavailable, not because C is unavailable, but because C is slow. Slowness is contagious in a way that hard failures often are not.
2. Retry amplification
When C fails, B retries. When B fails, A retries. When A fails, the client retries. One failed request at C becomes ten failed requests in aggregate. If each retry arrives when C is in the process of recovering, the burst of retried traffic can prevent recovery entirely. This is the thundering herd: simultaneous retries that overwhelm a system trying to come back online.
3. Cache avalanche
If C is a database and B is a caching layer in front of it, C's failure removes the cache's ability to revalidate or load-through. When the cache evicts entries (due to TTL or memory pressure), there is nothing to fall back to. The cache goes cold. When C recovers, all cache misses hit C simultaneously.
Identifying the Root Cause in a Dependency Chain
The pattern for diagnosis is the same every time: follow the dependency chain from the symptom to the source.
- Start at the outermost failure (what the user sees)
- Identify what that service depends on
- Check whether those dependencies are healthy
- Move inward until you find the first unhealthy component with healthy dependencies
The first unhealthy component with healthy dependencies is the root cause.
It is regularly ignored during incidents because monitoring surfaces all failures simultaneously, and the instinct is to fix the most visible one — which is often the outermost one, not the root.
Breaking the Chain: Isolation Patterns
A dependency chain is a design choice. Some coupling is unavoidable, but the strength of that coupling can be reduced.
Circuit breakers
A circuit breaker monitors calls to a dependency. When the failure rate exceeds a threshold, it "opens" — subsequent calls fail immediately without attempting the dependency. This stops the thread exhaustion and retry amplification. The breaker closes (resumes normal calls) after a timeout, giving the dependency time to recover.
Normal: A → B → C (C healthy, requests flow)
Degraded: A → B → C (C slow, B threads exhausted)
Breaker open: A → B → [fast fail] (B stops calling C, serves cached/default response)
Recovered: A → B → C (C healthy, breaker closes)
Timeouts
Every call to an external dependency needs an explicit timeout. Without one, threads wait indefinitely. A 30-second timeout on a database call that normally takes 5ms will hold a thread for 30 seconds when the database is down. Multiply by the number of concurrent requests and the connection pool exhausts within seconds.
Bulkheads
Isolate dependency calls in separate thread pools or connection pools. If Service A calls both B and C, use a separate pool for each. When C is slow and exhausts its pool, B's pool is unaffected. The failure is contained to the part of A that depends on C.
Fallbacks
Design for the case where a dependency is unavailable. Return a cached result, a default value, or a degraded response. A recommendation engine that cannot reach its model server can return trending items instead of personalised ones. The user sees a worse experience. The user does not see an error.
The Dependency Map as a First-Class Artifact
The dependency chain is only useful for diagnosis if it is known before the incident. A system whose dependencies are not documented or not monitored will produce outages where the diagnosis takes longer than the recovery.
Every service should have an explicit dependency map: what it calls, what it accepts calls from, and what the failure mode of each dependency is. This map is the starting point for both runbooks and circuit breaker configuration. Minimum useful content: service name, dependencies, expected latency, failure behaviour, and who owns it.
How This Connects to the Series
Dependency chains appear in Book 3 (distributed system design), Book 4 (system design interviews), and the Reference Book (Chapter 4, The Nine Frameworks). They connect to FM2 (Cascading Failures) and FM3 (Unbounded Resource Consumption).
The dependency chain does not just explain how systems fail. It explains the order in which they fail — and that order is the diagnosis.