The Technical Grounding

Before the five-layer stack can be designed, the product pattern must be chosen. Four distinct AI product patterns have different architectural requirements.

An assistant responds to explicit user requests. It may be single-turn or multi-turn, but the user is always in the loop — each response is the output of a user-initiated interaction. ChatGPT and GitHub Copilot autocomplete are assistants. Architecturally, assistants require session state for multi-turn context and fast synchronous inference for interactive latency.

An agent completes multi-step tasks autonomously. The user specifies a goal; the agent determines the steps, executes them, and returns the result. Devin and AutoGPT-style workflows are agents. Architecturally, agents require durable task state across multiple model calls, tool orchestration infrastructure, and retry logic for partial failures. A failed step in an agent workflow is not the same as a failed request — it is a failed task that may require rollback.

Augmentation enhances human work without replacing human judgment. The human sees the AI output and decides what to do with it. Grammarly, AI code review, and AI-assisted medical diagnosis are augmentation products. Architecturally, augmentation requires human-in-the-loop at the right latency: fast enough that the human is not waiting, slow enough that the human can process and respond to the output meaningfully.

Automation replaces a specific human task entirely, at scale. Email classification, content moderation, and fraud detection are automation products. Architecturally, automation requires evaluation infrastructure to catch errors at scale, because there is no human in the loop to notice and correct individual errors. A content moderation system that makes 10 million decisions per day and has a 1% error rate is making 100,000 incorrect decisions per day. Without evaluation infrastructure, the team has no way to know this.

The choice of product pattern is the first architectural decision for an AI product, because it determines what infrastructure must be built before the product can ship.

The latency versus quality tradeoff specific to AI is structurally different from the latency-quality tradeoffs in traditional software. In traditional software, latency is a property of the system’s architecture — improving latency typically requires architectural changes. In AI systems, latency is a direct function of model size and inference computation. Larger models produce higher-quality outputs but require more computation, which increases latency and cost.

The practical ceiling for synchronous user-facing inference is approximately 500 milliseconds of wall-clock time. At this threshold, most users perceive responsiveness as acceptable. Above it, users notice the delay and the interaction feels slow. Many high-quality models have inference times above this threshold at full precision. The engineering responses are: model quantisation (reducing numerical precision to speed up computation at some quality cost), speculative decoding (predicting likely continuations to reduce sequential computation), caching (storing the outputs of common inputs), and streaming (returning tokens as they are generated rather than waiting for the full output, which changes the user’s perception of latency without changing the actual computation time).

How to design for AI output variability is the most underappreciated architectural challenge in AI products. Deterministic systems fail in deterministic ways: the same input always produces the same failure. AI systems fail probabilistically: the same input sometimes produces an excellent output, sometimes an acceptable one, and occasionally a wrong one. Quality assurance in AI requires statistical evaluation — measuring the distribution of output quality over many inputs — rather than case-by-case testing. A unit test that passes on a specific input does not guarantee that the model performs acceptably on the distribution of inputs it will encounter in production.

The cost model for AI products at scale is distinct from traditional software. AI inference is pay-per-call for compute and pay-per-token for LLM API costs. At modest scale, these costs are negligible. At significant scale, inference cost dominates the system’s operating cost profile. A product that makes 10 million LLM API calls per month at $0.01 per 1,000 tokens, with an average of 500 tokens per call, incurs $50,000 per month in LLM API costs alone. These costs must be designed into the product’s unit economics from the start.

The context window is a product constraint that determines which products are possible at a given moment in AI capability development.

At 4K tokens, single-document Q&A and code completion are tractable. The model can hold one document or one function in context, reason about it, and produce a useful output. At 32K tokens, entire codebase sections and long-form document analysis become tractable. At 128K tokens and above, an entire codebase, a full product requirements document, or a multi-hour meeting transcript can be processed in a single call.

Architectural patterns for different context needs: RAG (retrieval-augmented generation) retrieves the most relevant context at query time and appends it to the prompt. This works for any window size and is the correct pattern when the knowledge base is larger than any context window. Fine-tuning bakes domain knowledge directly into the model’s weights. Use it when retrieval precision is insufficient — when the relevant knowledge is distributed across many documents in ways that retrieval cannot reliably surface. Tool use lets the model access live data at inference time. This is correct when the data changes faster than fine-tuning cadence allows — real-time stock prices, live inventory, current user state.

Cost management for AI products at scale is not an afterthought. The unit economics are explicit: a product making 1 million LLM calls per day at $0.01 per call incurs $300,000 per month in inference costs. At 10 million calls per day, that is $3 million per month. These numbers do not require unusual scale — 1 million calls per day is 12 calls per second, achievable by a product with tens of thousands of active users.

Caching strategies reduce this cost for products with repeated or similar queries. Exact-match caching stores the response for an identical prompt and returns it on the next identical request. This works well for FAQ-style patterns where many users ask the same question. Semantic caching stores responses indexed by embedding similarity, so queries that are semantically similar but not lexically identical can be served from cache. This requires an embedding similarity threshold below which a new model call is made rather than returning the cached response.

Fine-tuning can achieve a break-even reduction in per-call cost. If a fine-tuned smaller model reduces the cost per call from $0.01 to $0.001, the fine-tuning investment breaks even after a number of calls that depends on the fine-tuning compute cost. The arithmetic is worth doing explicitly before committing to fine-tuning infrastructure: at what call volume does the per-call saving justify the upfront investment?

The human-in-the-loop spectrum is a design dimension that must be chosen explicitly for each AI feature, not defaulted to fully automated.

Level	Description	When Appropriate	Error Cost
Fully automated	AI acts without human review	High volume, low stakes, fast correction possible	Low and correctable
Human review	Human reviews AI output before delivery	Medium stakes, audit trail required	Medium
Human approval	Human approves before action is taken	High stakes, irreversible actions	High or irreversible
Human-led, AI assists	Human leads, AI suggests	Expert domain, AI confidence low	Very high

Evaluation-driven development is the practice of defining the evaluation framework before the product framework. Traditional software has deterministic tests. An AI product has non-deterministic outputs; the same input will not always produce the same output, and the distribution of outputs matters as much as any individual output.

The minimum viable eval for a new AI feature has three components. A correctness eval measures: given 100 known inputs with known correct outputs, how many of the model’s outputs are correct? This requires a labeled dataset. A safety eval measures: given 50 adversarial inputs designed to elicit harmful outputs, how many outputs violate the product’s safety requirements? A UX eval measures: given 20 real user tasks, what fraction of users complete the task successfully with AI assistance? The ship rule is simple: do not ship an AI feature without a correctness eval score. Without a baseline score, the team cannot know whether a subsequent change has improved or degraded the product.

AI product design principles that differ from traditional software: expect uncertainty in output quality as a permanent state, not a bug to be fixed. Design for failure disclosure — when the system is unsure, it should say so rather than producing a confident but wrong answer. Apply human-in-the-loop design for high-stakes decisions where the cost of a wrong AI output is high. Evaluate continuously rather than at release boundaries, because model behaviour changes when input distribution shifts even if the model itself does not change.

Read in the book →

← The Concept Real-World Examples →