Circuit Breakers

Introduction

On August 14, 2003, a software bug in Ohio caused a power company’s alarm system to fail. Without alarms, operators did not notice when several high-voltage lines overloaded and tripped offline. Normal power grids have circuit breakers that isolate failures — when one line trips, it is disconnected from the rest of the grid so the failure cannot propagate. But the manual interventions that should have followed the initial failure did not happen, because no alarm sounded. Within two hours, a cascade of failures swept from Ohio through eight US states and Canada, leaving 55 million people without power for up to two days. The failure mode was not the initial overload. It was the absence of automatic isolation.

Software circuit breakers borrow the electrical concept directly: when a downstream dependency begins failing, the circuit breaker trips open and stops sending requests to it. Requests are rejected immediately rather than waiting for a timeout. The downstream service gets time to recover without being further overwhelmed by traffic it cannot handle. The upstream service returns errors quickly rather than exhausting its thread pool waiting for connections that will never succeed.

The mechanism sounds simple. Its value lies in what it prevents: the cascading failure where one slow dependency makes every service that depends on it slow, which makes every service that depends on those slow, which collapses the entire system.


Thread Activation

This chapter activates T7 (State Machines) and T11 (Feedback).

T7 began in Book 1, Chapter 1, where logical statements established the binary values (True/False) from which all computational state is built. In Book 1, Chapter 29, state machines formalised how systems transition between discrete states based on inputs. A circuit breaker is exactly a state machine: three states (CLOSED, OPEN, HALF-OPEN), transition rules based on failure counts and timeouts, and actions associated with each state (pass requests through, reject requests, probe the dependency).

T11 (Feedback) continues from Chapter 5 (rate limiting as a feedback loop) and Chapter 13 (queues as flow control). The circuit breaker is another instance of T11: it measures a signal (failure rate), applies a threshold, and changes behaviour when the threshold is exceeded — a feedback loop that adapts the system’s behaviour to prevent further damage. The pattern recurs throughout the reliability landscape because feedback loops are the fundamental mechanism for maintaining stable behaviour in the presence of disturbances.


The Concept

A circuit breaker is a stateful proxy that sits between a caller and a dependency. It monitors calls to the dependency and transitions between three states based on observed failure rates:

State Behaviour Transition out
CLOSED Requests pass through normally N consecutive failures → OPEN
OPEN All requests are rejected immediately (fast-fail) After timeout period → HALF-OPEN
HALF-OPEN One probe request passes through Success → CLOSED; Failure → OPEN

The circuit breaker does not fix the failing dependency. It protects the caller from being blocked by the dependency’s failure. When the dependency recovers, the circuit breaker detects recovery via the HALF-OPEN probe and restores normal operation.


How It Works

State Machine Implementation

struct CircuitBreaker:
    state: "CLOSED" | "OPEN" | "HALF_OPEN"
    failure_count: int
    failure_threshold: int        // N failures → trip to OPEN
    success_count: int
    success_threshold: int        // N successes in HALF_OPEN → CLOSED
    open_at: timestamp            // when did circuit open?
    reset_timeout: duration       // how long before trying HALF_OPEN?

function call_through(cb, dependency_fn):
    if cb.state == "OPEN":
        // Check if it's time to try HALF_OPEN
        if current_time() - cb.open_at >= cb.reset_timeout:
            transition_to_half_open(cb)
        else:
            raise CircuitOpenError("dependency unavailable — fast-fail")

    if cb.state == "HALF_OPEN":
        // Only one probe request passes through
        try:
            result = dependency_fn()
            on_half_open_success(cb)
            return result
        except Exception as e:
            on_half_open_failure(cb)
            raise

    // State: CLOSED — pass through normally
    try:
        result = dependency_fn()
        on_success(cb)
        return result
    except Exception as e:
        on_failure(cb)
        raise

function on_failure(cb):
    cb.failure_count += 1
    if cb.failure_count >= cb.failure_threshold:
        transition_to_open(cb)

function on_success(cb):
    cb.failure_count = 0   // reset on success in CLOSED state

function transition_to_open(cb):
    cb.state = "OPEN"
    cb.open_at = current_time()
    log("Circuit breaker OPENED for dependency: " + cb.name)
    emit_metric("circuit_breaker.opened", cb.name)

function transition_to_half_open(cb):
    cb.state = "HALF_OPEN"
    cb.success_count = 0
    log("Circuit breaker transitioning to HALF_OPEN: " + cb.name)

function on_half_open_success(cb):
    cb.success_count += 1
    if cb.success_count >= cb.success_threshold:
        cb.state = "CLOSED"
        cb.failure_count = 0
        log("Circuit breaker CLOSED: dependency recovered")
        emit_metric("circuit_breaker.closed", cb.name)

function on_half_open_failure(cb):
    cb.state = "OPEN"
    cb.open_at = current_time()
    log("Circuit breaker returned to OPEN: probe failed")

Sliding Window Failure Rate

// Count-based threshold is naive: 10 failures in 10 requests = 100% failure rate
// But 10 failures in 10,000 requests = 0.1% failure rate
// A count threshold does not distinguish between these cases

// Better: use a sliding window to measure failure rate, not count

struct SlidingWindowCB:
    window: RingBuffer[bool]   // true = success, false = failure
    window_size: int           // e.g., 100 most recent calls
    failure_rate_threshold: float   // e.g., 0.5 = 50% failure rate
    minimum_calls: int         // don't trip on first 5 calls of a cold start

function measure_failure_rate(cb):
    if len(cb.window) < cb.minimum_calls:
        return 0.0    // insufficient data
    failures = count(cb.window, where=false)
    return failures / len(cb.window)

function on_call_result(cb, success: bool):
    cb.window.push(success)  // replaces oldest entry if full

    if measure_failure_rate(cb) >= cb.failure_rate_threshold:
        if cb.state == "CLOSED":
            transition_to_open(cb)

Circuit Breaker per Dependency

// One circuit breaker per dependency — not one global breaker
// A slow payments service should not affect the recommendations service

struct ServiceClient:
    circuit_breakers: HashMap[dependency_name -> CircuitBreaker]

function call_dependency(client, dependency_name, fn):
    cb = client.circuit_breakers.get_or_create(
        dependency_name,
        CircuitBreaker(failure_threshold=5, reset_timeout=30s)
    )
    return call_through(cb, fn)

// Example: payment service failing should not affect user profile reads
function checkout(user_id, cart):
    // These calls have independent circuit breakers
    try:
        user_profile = call_dependency("user_service", () -> user_service.get(user_id))
    except CircuitOpenError:
        raise CheckoutError("user service unavailable")

    try:
        payment = call_dependency("payment_service", () -> payment_service.charge(cart.total))
    except CircuitOpenError:
        raise CheckoutError("payment service unavailable")

    // Recommendations failure should not block checkout
    try:
        upsells = call_dependency("recommendation_service",
                                  () -> recommendation_service.get(user_id))
    except CircuitOpenError:
        upsells = []   // graceful degradation: skip upsells if circuit is open

Bulkhead Complement

// Circuit breaker: detects failure and stops sending requests (time-based isolation)
// Bulkhead: limits the resources any one dependency can consume (resource isolation)
// Together: comprehensive protection against dependency failures

// Circuit breaker alone: a slow dependency keeps threads waiting until circuit opens
// During the N failures needed to trip the breaker, threads are held
// Bulkhead limits how many threads are held simultaneously

struct BulkheadedCircuitBreaker:
    circuit_breaker: CircuitBreaker
    semaphore: Semaphore    // limits concurrent calls (e.g., max 10 concurrent)

function call_with_bulkhead_and_cb(bcb, fn):
    // First check circuit breaker (cheap — no thread hold)
    if bcb.circuit_breaker.state == "OPEN":
        raise CircuitOpenError()

    // Try to acquire a slot in the bulkhead
    if not bcb.semaphore.try_acquire(timeout=10ms):
        raise BulkheadFullError("too many concurrent requests")

    try:
        return bcb.circuit_breaker.call_through(fn)
    finally:
        bcb.semaphore.release()

Tradeoffs

AT7 — Automation/Control: A circuit breaker automates the detection and isolation of dependency failures. Without it, engineers must manually identify failing dependencies, manually configure load balancers to route around them, and manually restore traffic after recovery. The circuit breaker does all three automatically: it detects failure by counting errors, isolates by rejecting requests without manual intervention, and restores by probing the dependency and closing when recovery is confirmed. The control cost is configuration complexity: the failure threshold, reset timeout, and HALF-OPEN probe count all require tuning to the specific characteristics of the dependency. A threshold that is too low trips on transient failures (network blips, slow requests during garbage collection pauses). A threshold that is too high allows too many requests to reach a failing dependency before tripping. Correct calibration requires production traffic data and ongoing adjustment as dependencies’ failure patterns change.

AT1 — Consistency/Availability: When a circuit breaker is OPEN, all requests to the protected dependency return an error immediately. From the caller’s perspective, the dependency is unavailable — not slow, but instantly failing. This is an improvement over waiting for timeouts (better for the caller’s latency), but it means the caller must handle an increased error rate during the open window. If the caller cannot degrade gracefully — if it requires a successful response from the dependency to serve its own request — the open circuit breaker surfaces the downstream failure to the end user rather than hiding it. The circuit breaker trades partial failure (some requests to the dependency fail) for complete isolation (all requests to the dependency fail while the circuit is open). This is correct for system stability but requires callers to implement graceful degradation (Chapter 21) to avoid propagating the error to users.


Where It Fails

FM2 — Cascading Failures Prevented by Circuit Breaker: A circuit breaker directly addresses FM2. Without circuit breakers, a slow or failing dependency causes callers to wait for responses that never arrive. Caller threads are held while waiting. Each held thread represents a slot in the caller’s thread pool. When all thread pool slots are held by waiting requests, the caller cannot process new requests — it effectively becomes as unavailable as the dependency. Callers of the caller experience the same problem. The failure cascades upward through the dependency tree. With circuit breakers at each service boundary, a failing dependency trips its circuit breaker after N failures, causing subsequent requests to be rejected immediately rather than waiting. Thread pools are not exhausted. Callers of callers continue to function normally (with degraded or missing features from the unavailable dependency, but not complete failure).

FM11 — Observability Blindness When Circuit Opens Silently: A circuit breaker that opens but does not alert is worse than no circuit breaker — it hides the failure while making the system’s behaviour invisible. From the user’s perspective, a feature stops working. From the engineer’s perspective, there are no errors in the upstream service (it is returning fast errors, not slow errors) and no alerts. The root cause — a failing downstream dependency — is invisible unless the circuit breaker state is exported as a metric and monitored. Every circuit breaker state transition (OPEN, HALF-OPEN, CLOSED) must be emitted as a metric and trigger an alert on OPEN. The circuit breaker state itself — the name of the dependency and the current state of its breaker — must be visible in the service’s health dashboard. A circuit breaker that opens without notification is an observability failure.


Real Systems

Netflix Hystrix was the circuit breaker library that popularised the pattern in microservices architectures. Developed at Netflix in 2012, Hystrix wrapped every remote call in a circuit breaker with configurable failure thresholds, fallback functions, and a real-time dashboard (the Hystrix Dashboard) that showed circuit state across all services. Hystrix is now in maintenance mode — Netflix migrated to resilience4j — but its design influenced every subsequent circuit breaker implementation.

resilience4j is the modern Java circuit breaker library. It implements sliding-window failure rate (count-based or time-based windows), configurable slow-call rate thresholds (not just errors — slow responses also trip the circuit), and a Bulkhead implementation using either semaphores or a separate thread pool. resilience4j integrates with Spring Boot Actuator to expose circuit state as health endpoints and Micrometer metrics.

Polly is the .NET equivalent: a resilience and transient-fault-handling library that provides circuit breakers, retry policies, bulkheads, and fallbacks as composable policy objects. Polly policies are stacked: a call can be wrapped in a retry policy (retry on transient errors) inside a circuit breaker (stop retrying if failure rate exceeds threshold) inside a bulkhead (limit concurrency). The composition of policies enables fine-grained control over the failure behaviour of each remote call.


Concept: Circuit Breakers

Thread: T7 (State Machines) ← Book 1, Ch 29 (State Machine Transitions) → Book 5, Ch 11 (Resilience Patterns)

Core Idea: A circuit breaker is a 3-state machine that fast-fails requests to a failing dependency, preventing thread pool exhaustion and cascading failures while allowing automatic recovery detection.

Tradeoff: AT7 — Automation/Control (circuit breaker automates isolation and recovery detection; threshold miscalibration causes false trips on transient failures or insufficient protection against slow dependencies)

Failure Mode: FM11 — Observability Blindness (a circuit breaker that opens without alerting hides the downstream failure — features stop working silently with no visible error in the upstream service)

Signal: When a service’s error rate increases and P99 latency simultaneously drops (errors are being returned faster than before), a circuit breaker has opened upstream — fast-failing is replacing slow timeouts.

Maps to: Book 0, Framework 2 (Failure Modes) and Framework 7 (Automation/Control)


Exercises

Level 2 — Apply

A user-facing API service calls three downstream services: AuthService (authentication), ProductService (product data), and RecommendationService (personalised suggestions). All three calls are required on every page load. The circuit breaker configuration is: failure threshold=5, reset timeout=30 seconds.

  1. AuthService begins returning errors for 30% of requests. After 5 consecutive failures, the circuit opens. Draw the state transition diagram and label the transitions with the triggering conditions. What does the API service return to users during the OPEN window?

  2. With the circuit open, the HALF-OPEN probe fires after 30 seconds. The probe succeeds. The circuit closes. But AuthService resumes failing again 10 seconds later. How does the circuit breaker detect the second failure wave? What is the minimum time the circuit is open during the second failure wave?

  3. The RecommendationService is optional — the page can load without personalised recommendations (just show defaults). Modify the checkout pseudocode example to implement graceful degradation when the RecommendationService circuit is open. What does the fallback return?

  4. All three circuit breakers are OPEN simultaneously (a network partition affecting the data center). The API service returns errors for all requests. How does the health endpoint help operations engineers understand what is happening? What specific metrics and state information should the health endpoint expose?

Level 3 — Design

Design the circuit breaker configuration for a payment processing service that calls: (1) FraudDetectionService — P99 latency 200ms, expected failure rate < 0.1%; (2) PaymentGateway — P99 latency 500ms, expected failure rate < 0.5%; (3) NotificationService — P99 latency 100ms, expected failure rate < 1%; (4) AuditService — P99 latency 50ms, expected failure rate < 0.1%.

  1. For each dependency, specify the circuit breaker configuration: failure threshold, failure rate threshold, slow-call threshold, reset timeout, and whether the call is required or optional. Justify each parameter choice based on the dependency’s characteristics.

  2. The PaymentGateway begins returning HTTP 503 errors during a maintenance window. The circuit breaker trips open. During the open window, incoming payment requests must not be silently dropped — they must be queued for retry when the circuit closes. Design the queue-based fallback mechanism. Specify: queue storage, retry trigger (how does the circuit closing trigger queue processing?), idempotency handling, and maximum queue depth before overflow.

  3. A “slow circuit breaker” variant opens when a dependency’s P99 latency exceeds a threshold rather than when errors exceed a threshold. The PaymentGateway is returning 200 OK but with 3-second latency (6× normal). Describe why a standard error-based circuit breaker would not trip in this scenario. Specify the slow-call configuration that would detect this situation and trip the breaker.

  4. The AuditService is mandatory for regulatory compliance — every payment must generate an audit record. Unlike other dependencies, the AuditService circuit breaker cannot use graceful degradation. Design the failure behaviour when AuditService is unavailable. Should the circuit breaker still be used? If so, what is the fallback? If payments are blocked until AuditService recovers, how do you prevent the payment service from accumulating a backlog of waiting requests?

A complete answer will: (1) specify distinct circuit breaker configurations for each dependency with each parameter justified — FraudDetectionService and AuditService configured as required, NotificationService as optional with a graceful degradation fallback, (2) identify FM2 (cascading failure — PaymentGateway queue overflows and backs up into the payment service thread pool) as the failure mode the queue depth limit in part (b) must prevent, (3) address the AT3 tradeoff between the slow-call circuit breaker (catches latency degradation before it causes P99 SLA violations) and the standard error-based breaker (misses slow-but-successful responses), and (4) design the AuditService failure mode — block new payments and write to a local transactional outbox for guaranteed delivery when AuditService recovers — with a backlog prevention mechanism (reject incoming payments with HTTP 503 when the outbox queue reaches a limit).