Failure Modes in This System

FM4 — Data Consistency Failure: Training-serving skew is the most insidious failure in ML systems. If the features used during training are computed differently from the features used during serving — different bucketing, different normalisation, different time windows — the model’s predictions in production will differ from its validation performance. Features must be computed identically in training and serving, which is why the feature store is the central component.

FM11 — Observability Blindness: Recommendation quality degrades without raising errors. If the feedback loop collapses — no new interaction data is processed, model weights stop updating — recommendations become stale and engagement drops. The signal is falling engagement metrics, not system errors. Click-through rate, session length, and explicit feedback are the observability instruments.

FM3 — Unbounded Resource Consumption: The candidate generation index grows as the catalogue grows. If item embeddings are not expired for deleted or unlisted items, the ANN index consumes unbounded memory. Explicit lifecycle management — removing items from the index when they are unlisted — is required.

Read in the book →

← Key Design Decisions How It Evolves at Scale →