Real-World Examples

GitHub Copilot’s architecture illustrates the latency-quality tradeoff in a production AI product. Copilot must produce code suggestions while the user is typing — a latency requirement measured in hundreds of milliseconds. Achieving this with large language models required aggressive optimisation: model distillation to produce smaller, faster models; speculative completion to reduce sequential computation; and streaming outputs to provide perceived responsiveness before the full suggestion is generated. The product’s quality is a direct function of the engineering investment in serving optimisation.

OpenAI’s production infrastructure for ChatGPT demonstrates cost model challenges at extreme scale. Inference for large language models at millions of users requires GPU clusters that represent billions of dollars in infrastructure investment. The cost per query is non-trivial. The product’s pricing model must recover this cost while remaining competitive. At scale, the cost model is the architecture.

Midjourney’s approach to output variability — presenting multiple variations of each generation and allowing users to select and iterate — is a product design response to the probabilistic nature of diffusion model outputs. Rather than presenting a single output and asking users to prompt again if it is wrong, the product presents the probabilistic nature of the system’s output as a feature. Users can select the variation that best matches their intent. This design turns the model’s variance from a bug into a differentiating interaction model.

Read in the book →

← The Technical Grounding The Tradeoffs →