Recommendation Engine

Introduction

Netflix once ran a $1M competition to improve its recommendation algorithm by 10%. Hundreds of teams competed for three years. The winning team improved the algorithm by 10.06% — barely over the threshold. Netflix never fully deployed the winning system. The winning algorithm was an ensemble of 107 models that required too much compute time to run in production. The recommendation system deployed was simpler, less accurate, and fast enough.

That contest outcome captures the central tension of recommendation systems: accuracy and latency are in constant tension. The most accurate models require the most compute. Production systems must serve recommendations in under 100ms. The architecture of a recommendation engine is the story of how to get accuracy close to the theoretical maximum within an engineering budget.

The Problem

Given a user and a context — what they are watching, what they have liked, what similar users have engaged with — return a ranked list of N items predicted to maximise engagement. The system must serve recommendations within 100ms for millions of concurrent users. The underlying models must be trained on petabytes of interaction data. Item embeddings must be updated as new items are added and old items become stale.

Naive Approach and Why It Fails

A naive approach runs the full ranking model on all items for each request. If the catalogue has 10M items and the model takes 1ms per item, a single recommendation request requires 10,000 seconds of compute. This fails immediately.

The second naive approach is pure collaborative filtering at query time: compute cosine similarity between the requesting user and all other users, find the most similar users, return items those users liked. With 100M users, pairwise similarity computation takes hours, not milliseconds.

The production solution is a multi-stage pipeline: candidate generation narrows 10M items to 1,000; scoring ranks those 1,000 items; re-ranking applies business rules. Each stage is progressively more expensive and progressively more accurate.

Architecture Walkthrough

Multi-Stage Pipeline

Candidate Generation

Two approaches dominate:

Collaborative filtering: items similar to what the user has interacted with (item-to-item similarity), computed using embeddings. Pre-compute item embedding vectors offline; at query time, find nearest neighbours using approximate nearest neighbour search (T2 — a tree-indexed vector space).

Content-based filtering: items similar in attributes to what the user prefers. A user who watches action films gets action films. This does not require interaction data — it works for new users (cold start).

Matrix factorisation: decompose the user-item interaction matrix into user embeddings and item embeddings. Each user and item is a dense vector; the dot product of a user vector and an item vector predicts the interaction probability. Pre-compute all user and item embeddings offline.

Feature Store

Candidate scoring requires features: user demographic features, item attributes, interaction history, contextual signals (time of day, device type). Features come from two stores:

Offline feature store: a data warehouse with historical features. Used for model training and for features that do not need to be fresh (user age, item category).

Online feature store: a low-latency key-value store (Redis) with pre-computed features that must be fresh for serving. User’s recent interactions, item’s current popularity, time-since-last-viewed.

Feature store architecture:
  Offline:  Data warehouse (Snowflake, BigQuery) ← batch pipelines
  Online:   Redis cluster ← streaming pipelines (Kafka → Flink → Redis)
  Serving:  Feature retrieval joins online + offline at query time

Scoring Model

The scoring model is typically a neural network or gradient boosted tree that takes features as input and outputs a predicted engagement probability. It runs on every candidate in the shortlist (~1,000 items) at query time. Batching all 1,000 candidates into a single model call amortises the model loading cost.

Re-ranking

Business rules applied after scoring: diversity (avoid 10 consecutive items from the same creator), freshness boost (prefer newer items over equally scored older items), policy filters (age-gated content, sponsored items), serendipity injection (surface a fraction of items outside predicted preferences).

Feedback Loop

User interactions — clicks, views, likes, skips — are fed back into the training pipeline. More interactions → better model → better recommendations → more interactions. This is T11 (Feedback) operating at the system level.

Recommendation → User interaction → Interaction log → Training data
→ Updated model → Better recommendation (loop)

Key Design Decisions

AT4 — Precomputation/On-Demand: Candidate generation is precomputed (item embeddings are pre-built, ANN index is pre-built). Scoring is on-demand (features retrieved and model scored at query time). Re-ranking is on-demand. The boundary between precomputed and on-demand sits between candidate generation and scoring.

AT2 — Latency/Throughput: The online feature store (Redis) provides sub-millisecond feature retrieval for individual users. The offline feature store provides high-throughput training data. Training and serving use different stores because latency and throughput requirements are incompatible in a single system.

AT9 — Correctness/Performance: Approximate nearest neighbour search (HNSW, FAISS) finds candidates in O(log N) with >95% recall, not O(N) with 100% recall. The recall drop is acceptable; the latency reduction (from seconds to milliseconds) is not optional.

Failure Modes in This System

FM4 — Data Consistency Failure: Training-serving skew is the most insidious failure in ML systems. If the features used during training are computed differently from the features used during serving — different bucketing, different normalisation, different time windows — the model’s predictions in production will differ from its validation performance. Features must be computed identically in training and serving, which is why the feature store is the central component.

FM11 — Observability Blindness: Recommendation quality degrades without raising errors. If the feedback loop collapses — no new interaction data is processed, model weights stop updating — recommendations become stale and engagement drops. The signal is falling engagement metrics, not system errors. Click-through rate, session length, and explicit feedback are the observability instruments.

FM3 — Unbounded Resource Consumption: The candidate generation index grows as the catalogue grows. If item embeddings are not expired for deleted or unlisted items, the ANN index consumes unbounded memory. Explicit lifecycle management — removing items from the index when they are unlisted — is required.

How It Evolves at Scale

At 10×: the candidate generation index grows 10×. ANN index quantisation (reducing embedding precision from 32-bit float to 8-bit integer) reduces memory by 4×. The scoring model may be distilled — a smaller student model trained to approximate a larger teacher model, matching 95% of the quality at 20% of the compute.

At 100×: real-time personalisation with bandits — reinforcement learning that explores the recommendation space while exploiting known user preferences. The feedback loop runs in near-real-time (seconds, not hours), enabling rapid adaptation to user mood and context.

Real-World Variants

Netflix uses a multi-stage pipeline with separate models for different recommendation surfaces (home page, because-you-watched, similar items). The home page ranking uses tens of features; the similar items surface uses primarily item-embedding similarity.

YouTube published the 2016 paper “Deep Neural Networks for YouTube Recommendations” — the canonical reference for multi-stage recommendation. Two neural networks: one for candidate generation (user embedding to item embedding), one for ranking (many features, including impression history).

Spotify emphasises audio feature similarity alongside collaborative filtering. Track embeddings are built from audio analysis (tempo, key, energy) combined with listening behaviour embeddings.

Amazon uses item-to-item collaborative filtering at massive scale. “Customers who bought X also bought Y” is a precomputed similarity table, updated continuously as purchases occur.

Concept: Recommendation Engine

Thread: T11 (Feedback) ← Ch 6 (API Gateway) → Ch 18 (Ride-Sharing); T5 (Caching) ← Ch 8 (Autocomplete) → Ch 22 (ML Feature Store)

Core Idea: A multi-stage pipeline — candidate generation, scoring, re-ranking — makes recommendation tractable at scale by narrowing 10M items to 1,000 before applying expensive per-item scoring. The feature store separates training-time correctness from serving-time latency.

Tradeoff: AT4 — Precomputation vs On-Demand: item embeddings and ANN indexes are precomputed; per-user feature retrieval and model scoring happen on-demand at query time.

Failure Mode: FM4 — Data Consistency Failure: training-serving skew, where features are computed differently during training and serving, causes silent degradation of recommendation quality in production.

Signal: When a catalogue of more than 10,000 items must be personalised for millions of users in under 100ms, with engagement as the primary quality metric.

Maps to: Book 0, Framework 6 (System Archetypes), A5 (Data Intelligence)

Exercises

Level 1 — Understand

Name the three stages of the multi-stage recommendation pipeline and describe the primary purpose of each stage.
What is training-serving skew, and what FM code describes the failure it produces in a recommendation system?
What is the difference between the online feature store and the offline feature store in a recommendation system? Why can a single database not serve both?

Level 2 — Apply

A recommendation system has 5M items. Candidate generation returns 500 items. The scoring model takes 0.5ms to score a single item. (a) If items are scored sequentially, what is the scoring latency? (b) If all 500 items are batched into a single model call that takes 50ms, what is the total recommendation latency? (c) What AT code describes the tradeoff between batch size and latency?
A feature store records user’s last 10 interactions with TTL of 24 hours. A user interacts with a product, then immediately refreshes recommendations. (a) How fresh is the interaction feature? (b) If the feature is in the offline store with a 6-hour update pipeline, what FM code describes the staleness? (c) What store type and pipeline would you use to serve sub-second freshness?

Level 3 — Design

A music streaming platform has 60M tracks and 50M monthly active users. Design a recommendation system for the “Daily Mix” feature (6 personalised playlists, generated nightly for each user). Requirements: playlists generated within a 4-hour nightly window, each playlist personalised to a specific mood or genre cluster, must include both known favourites and new discoveries. Design the candidate generation, scoring, and re-ranking pipeline. How do you handle FM4 training-serving skew? What is the feedback loop (T11) and how do you detect if it is degrading?

A complete answer will: (1) design a three-stage pipeline — candidate generation (ANN retrieval from user embedding against 60M track embeddings, retrieving top-1000 candidates per playlist), scoring (a ranking model applied to candidates using user history features and track metadata), and re-ranking (diversity injection to ensure new discoveries meet a stated fraction of each playlist, e.g., 20% new tracks) — with a concrete 4-hour budget estimate across 50M users showing the pipeline must parallelise across a compute cluster, (2) name FM4 (stale data / training-serving skew) and identify its specific form here: the training data distribution differs from serving distribution because the model trained on historical plays, which are biased toward tracks that were already popular — the mitigation is logging exploration plays separately and retraining on a debiased dataset using inverse propensity scoring, (3) describe the feedback loop and its degradation signal: if the model is retrained on its own recommendations, it amplifies whatever genres it initially surfaced, narrowing diversity over time — detectable by tracking the entropy of genre distribution in generated playlists across weekly model retraining cycles, and (4) propose a concrete discovery mechanism with its AT9 tradeoff: injecting tracks from underexplored genres at re-ranking improves discovery but degrades short-term engagement (measured by skip rate) — the answer must state an explicit exploration rate and how it is tuned against engagement metrics.

Read in the book →