The Computing Series

Exercises

Level 2 — Apply

A user-facing API service calls three downstream services: AuthService (authentication), ProductService (product data), and RecommendationService (personalised suggestions). All three calls are required on every page load. The circuit breaker configuration is: failure threshold=5, reset timeout=30 seconds.

  1. AuthService begins returning errors for 30% of requests. After 5 consecutive failures, the circuit opens. Draw the state transition diagram and label the transitions with the triggering conditions. What does the API service return to users during the OPEN window?

  2. With the circuit open, the HALF-OPEN probe fires after 30 seconds. The probe succeeds. The circuit closes. But AuthService resumes failing again 10 seconds later. How does the circuit breaker detect the second failure wave? What is the minimum time the circuit is open during the second failure wave?

  3. The RecommendationService is optional — the page can load without personalised recommendations (just show defaults). Modify the checkout pseudocode example to implement graceful degradation when the RecommendationService circuit is open. What does the fallback return?

  4. All three circuit breakers are OPEN simultaneously (a network partition affecting the data center). The API service returns errors for all requests. How does the health endpoint help operations engineers understand what is happening? What specific metrics and state information should the health endpoint expose?

Level 3 — Design

Design the circuit breaker configuration for a payment processing service that calls: (1) FraudDetectionService — P99 latency 200ms, expected failure rate < 0.1%; (2) PaymentGateway — P99 latency 500ms, expected failure rate < 0.5%; (3) NotificationService — P99 latency 100ms, expected failure rate < 1%; (4) AuditService — P99 latency 50ms, expected failure rate < 0.1%.

  1. For each dependency, specify the circuit breaker configuration: failure threshold, failure rate threshold, slow-call threshold, reset timeout, and whether the call is required or optional. Justify each parameter choice based on the dependency’s characteristics.

  2. The PaymentGateway begins returning HTTP 503 errors during a maintenance window. The circuit breaker trips open. During the open window, incoming payment requests must not be silently dropped — they must be queued for retry when the circuit closes. Design the queue-based fallback mechanism. Specify: queue storage, retry trigger (how does the circuit closing trigger queue processing?), idempotency handling, and maximum queue depth before overflow.

  3. A “slow circuit breaker” variant opens when a dependency’s P99 latency exceeds a threshold rather than when errors exceed a threshold. The PaymentGateway is returning 200 OK but with 3-second latency (6× normal). Describe why a standard error-based circuit breaker would not trip in this scenario. Specify the slow-call configuration that would detect this situation and trip the breaker.

  4. The AuditService is mandatory for regulatory compliance — every payment must generate an audit record. Unlike other dependencies, the AuditService circuit breaker cannot use graceful degradation. Design the failure behaviour when AuditService is unavailable. Should the circuit breaker still be used? If so, what is the fallback? If payments are blocked until AuditService recovers, how do you prevent the payment service from accumulating a backlog of waiting requests?

A complete answer will: (1) specify distinct circuit breaker configurations for each dependency with each parameter justified — FraudDetectionService and AuditService configured as required, NotificationService as optional with a graceful degradation fallback, (2) identify FM2 (cascading failure — PaymentGateway queue overflows and backs up into the payment service thread pool) as the failure mode the queue depth limit in part (b) must prevent, (3) address the AT3 tradeoff between the slow-call circuit breaker (catches latency degradation before it causes P99 SLA violations) and the standard error-based breaker (misses slow-but-successful responses), and (4) design the AuditService failure mode — block new payments and write to a local transactional outbox for guaranteed delivery when AuditService recovers — with a backlog prevention mechanism (reject incoming payments with HTTP 503 when the outbox queue reaches a limit).

Read in the book →