AI Product Architecture: RAG vs Fine-Tune vs Tool Use

An AI product is not a traditional software product with a model bolted on. It is a system where every layer has distinct engineering requirements, output quality is probabilistic rather than deterministic, and the failure modes are different in kind from those in conventional software. Technical leaders who approach AI products with the mental models of traditional software design will make the right structural decisions for the wrong system.

The reason this matters: the decisions you take in the first month of building an AI product compound. If you skip evaluation infrastructure, you ship a product whose quality you cannot measure. If you skip cost modelling, you launch a product whose unit economics break at the scale you hoped to reach. If you skip the RAG vs fine-tune vs prompt decision, you spend six months optimising a pattern that was wrong for your context shape. The architecture of an AI product follows from understanding what makes AI systems fundamentally different.

The Five Layers Every AI Product Has

Every AI product has the same five-layer stack — data, model, serving, application, evaluation. Each layer has distinct requirements and ignoring any one of them creates a failure mode the other four cannot compensate for.

Data is the foundation. The quality, coverage, and freshness of training data determines the ceiling of what the model can do. No serving optimisation, prompt engineering, or fine-tuning can compensate for training data that is systematically biased, incomplete, or stale.

Model sits on top of the data. Model selection involves tradeoffs between quality, latency, cost, and the ability to fine-tune. A large general-purpose language model produces higher-quality outputs but has higher inference latency and cost. A smaller task-specific model runs faster and cheaper but fails on inputs outside its training distribution.

Serving is the infrastructure that delivers model outputs to users with acceptable latency. For most user-facing products, 500 milliseconds is the upper bound on acceptable response time for synchronous interactions. This constraint is severe: it excludes the largest, highest-quality models from synchronous user-facing deployment unless aggressive optimisation is applied.

Application is where product logic lives. Prompt construction, output parsing, context management, and fallback handling are application-layer concerns. The application layer is where the product's behaviour is defined; changes to it can dramatically change the user experience without changing the model.

Evaluation is the layer most commonly missing in early AI products and most consequential in mature ones. Without evaluation infrastructure, the team cannot measure whether the product is working, cannot detect when it starts working worse, and cannot make principled decisions about model upgrades or prompt changes.

Four AI Product Patterns — Choose Before You Architect

Before the five-layer stack can be designed, the product pattern must be chosen. Four distinct AI product patterns have different architectural requirements, and conflating them produces systems that fail predictably.

An assistant responds to explicit user requests — single-turn or multi-turn, but the user is always in the loop. ChatGPT and GitHub Copilot autocomplete are assistants. Architecturally, assistants require session state for multi-turn context and fast synchronous inference for interactive latency.

An agent completes multi-step tasks autonomously. The user specifies a goal; the agent determines the steps, executes them, and returns the result. Devin and AutoGPT-style workflows are agents. Architecturally, agents require durable task state across multiple model calls, tool orchestration infrastructure, and retry logic for partial failures. A failed step in an agent workflow is not the same as a failed request — it is a failed task that may require rollback.

Augmentation enhances human work without replacing human judgment. The human sees the AI output and decides what to do with it. Grammarly, AI code review, and AI-assisted medical diagnosis are augmentation products. Architecturally, augmentation requires human-in-the-loop at the right latency: fast enough that the human is not waiting, slow enough that the human can process and respond to the output meaningfully.

Automation replaces a specific human task entirely, at scale. Email classification, content moderation, fraud detection. Architecturally, automation requires evaluation infrastructure to catch errors at scale, because there is no human in the loop to notice and correct individual errors. A content moderation system making 10 million decisions per day at a 1% error rate is making 100,000 incorrect decisions per day. Without evaluation, the team has no way to know this.

RAG vs Fine-Tune vs Tool Use — The Decision Matrix

Once the pattern is chosen, the next question is how the model accesses the knowledge it needs. There are three architectural patterns, and the right choice depends on the shape of the knowledge, not preference.

RAG (retrieval-augmented generation) retrieves the most relevant context at query time and appends it to the prompt. This works for any context window size and is the correct pattern when the knowledge base is larger than any context window. Use it when your knowledge is text-shaped, indexable, and changes often enough that retraining is impractical.

Fine-tuning bakes domain knowledge directly into the model's weights. Use it when retrieval precision is insufficient — when the relevant knowledge is distributed across many documents in ways that retrieval cannot reliably surface. Fine-tuning is also the lever for cost reduction at scale: a fine-tuned smaller model can reduce inference cost by 10× while preserving quality on the narrow task it was tuned for.

Tool use lets the model access live data at inference time through structured function calls. Use it when the data changes faster than fine-tuning cadence allows — real-time stock prices, live inventory, current user state. The model holds the reasoning; the tools hold the facts.

These are not exclusive. A mature product uses all three: RAG for the knowledge corpus, fine-tuning for the task-specific reasoning, tool use for live data. The mistake is starting with all three at once.

The 500ms Ceiling and What It Forces

The latency-versus-quality tradeoff specific to AI is structurally different from the latency-quality tradeoffs in traditional software. In traditional software, latency is a property of the system's architecture — improving it requires architectural changes. In AI, latency is a direct function of model size and inference computation. Larger models produce higher-quality outputs but require more computation.

The practical ceiling for synchronous user-facing inference is approximately 500 milliseconds. At this threshold, most users perceive responsiveness as acceptable. Above it, users notice the delay and the interaction feels slow. Many high-quality models exceed this threshold at full precision.

The engineering responses are: model quantisation (reducing numerical precision to speed up computation at some quality cost); speculative decoding (predicting likely continuations to reduce sequential computation); caching (storing outputs of common inputs); and streaming (returning tokens as they are generated, which changes the user's perception of latency without changing actual computation time).

GitHub Copilot's architecture illustrates this in production. Copilot must produce code suggestions while the user is typing — latency measured in hundreds of milliseconds. Achieving this with large language models required model distillation to produce smaller, faster models; speculative completion to reduce sequential computation; and streaming outputs to provide perceived responsiveness before the full suggestion is generated. The product's quality is a direct function of the engineering investment in serving optimisation.

The Cost Model You Must Build on Day One

Cost management for AI products at scale is not an afterthought. The unit economics are explicit: a product making 1 million LLM calls per day at $0.01 per call incurs $300,000 per month in inference costs. At 10 million calls per day, that is $3 million per month. These numbers do not require unusual scale — 1 million calls per day is 12 calls per second, achievable by a product with tens of thousands of active users.

Exact-match caching stores the response for an identical prompt and returns it on the next identical request. This works well for FAQ-style patterns where many users ask the same question.

Semantic caching stores responses indexed by embedding similarity, so queries that are semantically similar but not lexically identical can be served from cache. This requires a similarity threshold below which a new model call is made rather than returning the cached response.

Fine-tuning achieves a break-even reduction in per-call cost. If a fine-tuned smaller model reduces the cost per call from $0.01 to $0.001, the fine-tuning investment breaks even after a number of calls that depends on the upfront compute cost. The arithmetic is worth doing explicitly before committing to fine-tuning infrastructure: at what call volume does the per-call saving justify the upfront investment?

Evaluation-Driven Development

The minimum viable eval for a new AI feature has three components.

A correctness eval measures: given 100 known inputs with known correct outputs, how many of the model's outputs are correct? This requires a labeled dataset.

A safety eval measures: given 50 adversarial inputs designed to elicit harmful outputs, how many outputs violate the product's safety requirements?

A UX eval measures: given 20 real user tasks, what fraction of users complete the task successfully with AI assistance?

The ship rule: do not ship an AI feature without a correctness eval score. Without a baseline, the team cannot know whether a subsequent change has improved or degraded the product.

This matters because AI systems fail differently than deterministic systems. Deterministic systems fail in deterministic ways: the same input always produces the same failure. AI systems fail probabilistically: the same input sometimes produces an excellent output, sometimes an acceptable one, and occasionally a wrong one. A unit test that passes on a specific input does not guarantee acceptable performance on the distribution of inputs the model will encounter in production.

What Goes Wrong

Two failure patterns recur across early AI products.

Shipping without evaluation loops. The team adds a new feature, changes a prompt, or upgrades a model without measuring the impact on output quality. The product changes in ways that are invisible to the team and visible to users. This is FM11 — Observability Blindness, applied to a probabilistic system: the failure mode that most engineering monitoring is not built to detect.

Over-reliance on a single LLM API. When the API provider changes model behaviour, updates pricing, or experiences an outage, the product has no fallback. A system that depends entirely on a single external AI API has the same structural fragility as a system with a single point of failure (FM1) — it is just a failure mode that most engineers have not encountered yet.

This article draws from Book 7, Chapter 12 — AI Product Architecture in The Computing Series. The book covers the full five-layer stack with each layer's design principles, the four product patterns end-to-end, the cost model arithmetic, the human-in-the-loop spectrum, and the evaluation-driven development practice in working depth.

The framework codes referenced here — AT9 (Speed vs Quality), FM1 (Single Point of Failure), FM11 (Observability Blindness), and the T12 (Tradeoffs) thread that runs through every book in the series — are part of the controlled engineering vocabulary the series builds.

Read Book 7, Chapter 12 →