The Computing Series

Introduction

On August 14, 2003, a software bug in Ohio caused a power company’s alarm system to fail. Without alarms, operators did not notice when several high-voltage lines overloaded and tripped offline. Normal power grids have circuit breakers that isolate failures — when one line trips, it is disconnected from the rest of the grid so the failure cannot propagate. But the manual interventions that should have followed the initial failure did not happen, because no alarm sounded. Within two hours, a cascade of failures swept from Ohio through eight US states and Canada, leaving 55 million people without power for up to two days. The failure mode was not the initial overload. It was the absence of automatic isolation.

Software circuit breakers borrow the electrical concept directly: when a downstream dependency begins failing, the circuit breaker trips open and stops sending requests to it. Requests are rejected immediately rather than waiting for a timeout. The downstream service gets time to recover without being further overwhelmed by traffic it cannot handle. The upstream service returns errors quickly rather than exhausting its thread pool waiting for connections that will never succeed.

The mechanism sounds simple. Its value lies in what it prevents: the cascading failure where one slow dependency makes every service that depends on it slow, which makes every service that depends on those slow, which collapses the entire system.


Read in the book →