AI Product Architecture

Introduction

An AI product is not a traditional software product with a model attached. It is a system where every layer has distinct engineering requirements, the quality of output is probabilistic rather than deterministic, and failure modes are different in kind from those in conventional software.

Technical leaders who approach AI products with the mental models of traditional software design will make the right structural decisions for the wrong system. The architecture of an AI product — from data management to serving to evaluation — requires a different set of design principles, and those principles follow from understanding what makes AI systems fundamentally different.


Thread Activation

You have seen tradeoffs as a recurring theme across every book in this series: time versus space in algorithms, consistency versus availability in distributed systems, coupling versus cohesion in code architecture, autonomy versus coordination in engineering leadership. At each layer, the tradeoff framework is the same — identify what you gain, identify what you give up, define the conditions under which the exchange is correct. This chapter applies it to the AI product layer, where the relevant dimensions are quality, latency, cost, and controllability. No architecture optimises all four simultaneously. Every design choice is a point in this four-dimensional tradeoff space, and understanding the space is the prerequisite for navigating it.


The Concept

The AI product stack has five layers, each with distinct requirements.

Data is the foundation. The quality, coverage, and freshness of training data determines the ceiling of what the model can do. No serving optimisation, prompt engineering, or fine-tuning can compensate for training data that is systematically biased, incomplete, or stale. Data management for AI products is an ongoing engineering concern, not a one-time collection exercise.

The model sits on top of the data. Model selection involves tradeoffs between quality, latency, cost, and the ability to fine-tune. A large general-purpose language model produces higher-quality outputs on diverse tasks but has higher inference latency and cost. A smaller task-specific model runs faster and cheaper but fails on inputs outside its training distribution.

Serving is the infrastructure that delivers model outputs to users with acceptable latency and availability. For most user-facing products, 500 milliseconds is the upper bound on acceptable response time for synchronous interactions. This constraint is severe: it excludes the largest and highest-quality models from synchronous user-facing deployment unless aggressive optimisation is applied.

The application layer is where product logic lives. Prompt construction, output parsing, context management, and fallback handling are application-layer concerns. The application layer is where the product’s behaviour is defined; changes to it can dramatically change the user experience without changing the model.

Evaluation is the layer that is most commonly missing in early AI products and most consequential in mature ones. Without evaluation infrastructure, the team cannot measure whether the product is working, cannot detect when it starts working worse, and cannot make principled decisions about model upgrades or prompt changes.


The Technical Grounding

Before the five-layer stack can be designed, the product pattern must be chosen. Four distinct AI product patterns have different architectural requirements.

An assistant responds to explicit user requests. It may be single-turn or multi-turn, but the user is always in the loop — each response is the output of a user-initiated interaction. ChatGPT and GitHub Copilot autocomplete are assistants. Architecturally, assistants require session state for multi-turn context and fast synchronous inference for interactive latency.

An agent completes multi-step tasks autonomously. The user specifies a goal; the agent determines the steps, executes them, and returns the result. Devin and AutoGPT-style workflows are agents. Architecturally, agents require durable task state across multiple model calls, tool orchestration infrastructure, and retry logic for partial failures. A failed step in an agent workflow is not the same as a failed request — it is a failed task that may require rollback.

Augmentation enhances human work without replacing human judgment. The human sees the AI output and decides what to do with it. Grammarly, AI code review, and AI-assisted medical diagnosis are augmentation products. Architecturally, augmentation requires human-in-the-loop at the right latency: fast enough that the human is not waiting, slow enough that the human can process and respond to the output meaningfully.

Automation replaces a specific human task entirely, at scale. Email classification, content moderation, and fraud detection are automation products. Architecturally, automation requires evaluation infrastructure to catch errors at scale, because there is no human in the loop to notice and correct individual errors. A content moderation system that makes 10 million decisions per day and has a 1% error rate is making 100,000 incorrect decisions per day. Without evaluation infrastructure, the team has no way to know this.

The choice of product pattern is the first architectural decision for an AI product, because it determines what infrastructure must be built before the product can ship.

The latency versus quality tradeoff specific to AI is structurally different from the latency-quality tradeoffs in traditional software. In traditional software, latency is a property of the system’s architecture — improving latency typically requires architectural changes. In AI systems, latency is a direct function of model size and inference computation. Larger models produce higher-quality outputs but require more computation, which increases latency and cost.

The practical ceiling for synchronous user-facing inference is approximately 500 milliseconds of wall-clock time. At this threshold, most users perceive responsiveness as acceptable. Above it, users notice the delay and the interaction feels slow. Many high-quality models have inference times above this threshold at full precision. The engineering responses are: model quantisation (reducing numerical precision to speed up computation at some quality cost), speculative decoding (predicting likely continuations to reduce sequential computation), caching (storing the outputs of common inputs), and streaming (returning tokens as they are generated rather than waiting for the full output, which changes the user’s perception of latency without changing the actual computation time).

How to design for AI output variability is the most underappreciated architectural challenge in AI products. Deterministic systems fail in deterministic ways: the same input always produces the same failure. AI systems fail probabilistically: the same input sometimes produces an excellent output, sometimes an acceptable one, and occasionally a wrong one. Quality assurance in AI requires statistical evaluation — measuring the distribution of output quality over many inputs — rather than case-by-case testing. A unit test that passes on a specific input does not guarantee that the model performs acceptably on the distribution of inputs it will encounter in production.

The cost model for AI products at scale is distinct from traditional software. AI inference is pay-per-call for compute and pay-per-token for LLM API costs. At modest scale, these costs are negligible. At significant scale, inference cost dominates the system’s operating cost profile. A product that makes 10 million LLM API calls per month at $0.01 per 1,000 tokens, with an average of 500 tokens per call, incurs $50,000 per month in LLM API costs alone. These costs must be designed into the product’s unit economics from the start.

The context window is a product constraint that determines which products are possible at a given moment in AI capability development.

At 4K tokens, single-document Q&A and code completion are tractable. The model can hold one document or one function in context, reason about it, and produce a useful output. At 32K tokens, entire codebase sections and long-form document analysis become tractable. At 128K tokens and above, an entire codebase, a full product requirements document, or a multi-hour meeting transcript can be processed in a single call.

Architectural patterns for different context needs: RAG (retrieval-augmented generation) retrieves the most relevant context at query time and appends it to the prompt. This works for any window size and is the correct pattern when the knowledge base is larger than any context window. Fine-tuning bakes domain knowledge directly into the model’s weights. Use it when retrieval precision is insufficient — when the relevant knowledge is distributed across many documents in ways that retrieval cannot reliably surface. Tool use lets the model access live data at inference time. This is correct when the data changes faster than fine-tuning cadence allows — real-time stock prices, live inventory, current user state.

Cost management for AI products at scale is not an afterthought. The unit economics are explicit: a product making 1 million LLM calls per day at $0.01 per call incurs $300,000 per month in inference costs. At 10 million calls per day, that is $3 million per month. These numbers do not require unusual scale — 1 million calls per day is 12 calls per second, achievable by a product with tens of thousands of active users.

Caching strategies reduce this cost for products with repeated or similar queries. Exact-match caching stores the response for an identical prompt and returns it on the next identical request. This works well for FAQ-style patterns where many users ask the same question. Semantic caching stores responses indexed by embedding similarity, so queries that are semantically similar but not lexically identical can be served from cache. This requires an embedding similarity threshold below which a new model call is made rather than returning the cached response.

Fine-tuning can achieve a break-even reduction in per-call cost. If a fine-tuned smaller model reduces the cost per call from $0.01 to $0.001, the fine-tuning investment breaks even after a number of calls that depends on the fine-tuning compute cost. The arithmetic is worth doing explicitly before committing to fine-tuning infrastructure: at what call volume does the per-call saving justify the upfront investment?

The human-in-the-loop spectrum is a design dimension that must be chosen explicitly for each AI feature, not defaulted to fully automated.

Level Description When Appropriate Error Cost
Fully automated AI acts without human review High volume, low stakes, fast correction possible Low and correctable
Human review Human reviews AI output before delivery Medium stakes, audit trail required Medium
Human approval Human approves before action is taken High stakes, irreversible actions High or irreversible
Human-led, AI assists Human leads, AI suggests Expert domain, AI confidence low Very high

Evaluation-driven development is the practice of defining the evaluation framework before the product framework. Traditional software has deterministic tests. An AI product has non-deterministic outputs; the same input will not always produce the same output, and the distribution of outputs matters as much as any individual output.

The minimum viable eval for a new AI feature has three components. A correctness eval measures: given 100 known inputs with known correct outputs, how many of the model’s outputs are correct? This requires a labeled dataset. A safety eval measures: given 50 adversarial inputs designed to elicit harmful outputs, how many outputs violate the product’s safety requirements? A UX eval measures: given 20 real user tasks, what fraction of users complete the task successfully with AI assistance? The ship rule is simple: do not ship an AI feature without a correctness eval score. Without a baseline score, the team cannot know whether a subsequent change has improved or degraded the product.

AI product design principles that differ from traditional software: expect uncertainty in output quality as a permanent state, not a bug to be fixed. Design for failure disclosure — when the system is unsure, it should say so rather than producing a confident but wrong answer. Apply human-in-the-loop design for high-stakes decisions where the cost of a wrong AI output is high. Evaluate continuously rather than at release boundaries, because model behaviour changes when input distribution shifts even if the model itself does not change.


Real-World Examples

GitHub Copilot’s architecture illustrates the latency-quality tradeoff in a production AI product. Copilot must produce code suggestions while the user is typing — a latency requirement measured in hundreds of milliseconds. Achieving this with large language models required aggressive optimisation: model distillation to produce smaller, faster models; speculative completion to reduce sequential computation; and streaming outputs to provide perceived responsiveness before the full suggestion is generated. The product’s quality is a direct function of the engineering investment in serving optimisation.

OpenAI’s production infrastructure for ChatGPT demonstrates cost model challenges at extreme scale. Inference for large language models at millions of users requires GPU clusters that represent billions of dollars in infrastructure investment. The cost per query is non-trivial. The product’s pricing model must recover this cost while remaining competitive. At scale, the cost model is the architecture.

Midjourney’s approach to output variability — presenting multiple variations of each generation and allowing users to select and iterate — is a product design response to the probabilistic nature of diffusion model outputs. Rather than presenting a single output and asking users to prompt again if it is wrong, the product presents the probabilistic nature of the system’s output as a feature. Users can select the variation that best matches their intent. This design turns the model’s variance from a bug into a differentiating interaction model.


The Tradeoffs

The tradeoff between model size and deployment constraints is pervasive. Teams that evaluate AI products using large API-hosted models often discover that the model they want to deploy cannot meet their latency requirements, cost requirements, or data privacy requirements when running at scale in their own infrastructure. The evaluation model and the production model are different systems, and the gap between them requires engineering investment.

Evaluation infrastructure has a cost. Building evaluation pipelines, collecting human labels, and running statistical quality checks requires significant engineering investment before it produces value. Teams that skip evaluation ship AI products without knowing whether they are working.


What Goes Wrong

AI products without evaluation loops ship blindly. The team adds a new feature, changes a prompt, or upgrades a model without measuring the impact on output quality. The product changes in ways that are invisible to the team and visible to users.

Over-reliance on the LLM API as the only architectural component produces fragile systems. When the API provider changes model behaviour, updates pricing, or experiences an outage, the product has no fallback. A system that depends entirely on a single external AI API has the same structural fragility as a system with a single point of failure — it is just a failure mode that most engineers have not encountered yet.

Concept: An AI product has five architectural layers — data, model, serving, application, evaluation — each with distinct requirements. Output quality is probabilistic. Evaluation infrastructure is not optional.

Thread: T12 (Tradeoffs) ← every AI product decision involves a tradeoff between quality, latency, cost, and controllability → no choice optimises all four simultaneously

Core Idea: The 500ms latency ceiling constrains model selection for synchronous user-facing products. Inference cost dominates operating costs at scale. Evaluation is a first-class engineering deliverable.

Tradeoff: AT9 (Speed vs Quality) — larger models produce higher-quality outputs but exceed latency budgets for synchronous interactions; the tradeoff cannot be avoided, only managed through serving optimisation

Failure Mode: FM11 (Observability Blindness) — an AI product without evaluation infrastructure cannot detect quality degradation; the product gets worse without any observable failure event in traditional monitoring

Signal: When a prompt change or model upgrade is deployed without a prior evaluation of quality impact, the team is operating the AI product without a feedback loop

Maps to: Book 0, Framework 4 (Architecture Tradeoffs)

Reflection Questions

  1. Map an AI product you have worked on or used to the five-layer stack. Which layers are most developed? Which are weakest? What would closing the weakest layer’s gaps require?

  2. The chapter argues that 500ms is the practical ceiling for synchronous user-facing AI inference. Identify an AI product feature you would like to build. What is the inference latency of the model you would use? If it exceeds 500ms, what optimisation strategies would you apply?

  3. Design a cost model for an AI product with one million monthly active users, each making an average of 20 AI requests per day. What are the inference costs at different pricing tiers? At what scale does inference cost dominate operating costs?

  4. The chapter argues that AI output quality is probabilistic and requires statistical evaluation rather than case-by-case testing. What does an evaluation pipeline for a code completion product look like? What metrics would you use? How would you collect ground truth?