Designing a Recommendation Engine: Lessons

In 2006, Netflix offered one million dollars to anyone who could improve their recommendations by 10%. Three years later, a team called BellKor's Pragmatic Chaos won. They improved accuracy by 10.06%. Netflix never deployed the winning algorithm.

The model was too slow for production. It blended over 100 sub-models. Training took days. Scoring a single user took seconds. The research was brilliant. The engineering constraints made it unusable.

This is the gap between recommendation science and recommendation engineering. Every recommendation engine lives in that gap.

Three Design Decisions

Decision 1: Collaborative Filtering vs Content-Based

Collaborative filtering finds users who behave like you. It recommends what they liked. It requires no understanding of the content itself. Content-based filtering analyzes item attributes. It recommends items similar to what you already consumed.

Collaborative filtering works with zero domain knowledge and sacrifices some correctness for ease of implementation. Content-based needs feature engineering per domain but produces more explainable recommendations.

Most production systems use both. Netflix combines collaborative signals with content metadata. Spotify blends listening patterns with audio features. The hybrid trades simplicity for better recommendations.

Decision 2: Precomputed Candidates vs Real-Time Scoring

You have 10 million items. A user opens the app. You have 200 milliseconds to show recommendations. You cannot score all 10 million items in 200 milliseconds.

The solution: a two-stage pipeline. Stage one precomputes a candidate set offline. Stage two scores that smaller set in real time. The precomputed candidates are stale by definition. The real-time scoring is fresh but limited to what stage one selected.

This tradeoff is permanent. Precompute too aggressively and you miss emerging content. Score too much in real time and you blow your latency budget.

Decision 3: Batch Training vs Real-Time Updates

Models trained on yesterday's data miss today's trends. A song goes viral at noon. Batch-trained models will not recommend it until tomorrow morning. Users see stale suggestions for 18 hours.

Real-time model updates fix staleness but introduce instability. A burst of bot activity can poison recommendations within minutes. Production systems use a hybrid: batch training for the base model, real-time features for session context.

The Three-Stage Pipeline

User opens app → 10M items → 200ms budget

Stage 1: Candidate Generation   (8ms)
         ANN index
         10M items → 1,000 candidates

Stage 2: Ranking                (25ms)
         ML model
         Features: watch history, time of day, device, freshness
         1,000 candidates → top 100

Stage 3: Re-Ranking             (10ms)
         Business rules
         Deduplicate series, boost new releases,
         enforce diversity, demote dismissed items
         100 → 30 shown

Total: ~43ms  ←  within 200ms budget

Two Failure Modes

Latency Amplification in the Scoring Path

A recommendation request touches multiple services. Candidate generation calls an embedding index. Scoring calls a feature store. Re-ranking calls a business rules engine. Each service adds latency.

Small delays multiply across the chain. Candidate generation: 8ms. Scoring: 25ms. Re-ranking: 15ms. Feature store lookup: 12ms. Total: 60ms on a good day. One slow feature store response pushes the total past 200ms.

The fix: parallelize independent calls. Set hard timeouts on each stage. Return partial results rather than waiting for stragglers.

Hotspotting on Popular Items

Popular items attract disproportionate engagement signals. The model learns to recommend them more. More recommendations generate more engagement. The feedback loop concentrates traffic on a tiny set of items.

Hotspotting manifests as load imbalance on the serving layer. The embedding vectors for popular items get requested thousands of times more than long-tail items. Some shards run hot while others idle.

The fix: exploration budgets that deliberately inject less-popular items. Popularity-weighted sampling that discounts items above a threshold.

Evolution at 10x Scale

At 10x users, the candidate generation index no longer fits on one machine. You shard it. Sharding introduces consistency delays when new items are added.

At 10x items, the ranking model's feature space explodes. You move to a feature store with pre-materialized features. The feature store becomes a new latency bottleneck.

At 10x request rate, the ANN index needs read replicas. Each replica serves a slightly different version of the index during updates. Users on different replicas see different recommendations for the same query.

The recommendation engine becomes the most complex subsystem in the product.

Concept: Recommendation engine — the three-stage pipeline

Tradeoff: AT4 — precomputed candidates are stale but fast; real-time scoring is fresh but limited to what stage one selected

Failure Mode: FM5 — each pipeline stage adds latency; one slow service pushes the total past the response budget

Signal: You need to personalize results from a catalog too large to score in real time

Series: Book 4, Ch 9