AI products without evaluation loops ship blindly. The team adds a new feature, changes a prompt, or upgrades a model without measuring the impact on output quality. The product changes in ways that are invisible to the team and visible to users.
Over-reliance on the LLM API as the only architectural component produces fragile systems. When the API provider changes model behaviour, updates pricing, or experiences an outage, the product has no fallback. A system that depends entirely on a single external AI API has the same structural fragility as a system with a single point of failure — it is just a failure mode that most engineers have not encountered yet.
Concept: An AI product has five architectural layers — data, model, serving, application, evaluation — each with distinct requirements. Output quality is probabilistic. Evaluation infrastructure is not optional.
Thread: T12 (Tradeoffs) ← every AI product decision involves a tradeoff between quality, latency, cost, and controllability → no choice optimises all four simultaneously
Core Idea: The 500ms latency ceiling constrains model selection for synchronous user-facing products. Inference cost dominates operating costs at scale. Evaluation is a first-class engineering deliverable.
Tradeoff: AT9 (Speed vs Quality) — larger models produce higher-quality outputs but exceed latency budgets for synchronous interactions; the tradeoff cannot be avoided, only managed through serving optimisation
Failure Mode: FM11 (Observability Blindness) — an AI product without evaluation infrastructure cannot detect quality degradation; the product gets worse without any observable failure event in traditional monitoring
Signal: When a prompt change or model upgrade is deployed without a prior evaluation of quality impact, the team is operating the AI product without a feedback loop
Maps to: Reference Book, Framework 4 (Architecture Tradeoffs)