Exercises

Level 1 — Understand

Name the three stages of the multi-stage recommendation pipeline and describe the primary purpose of each stage.
What is training-serving skew, and what FM code describes the failure it produces in a recommendation system?
What is the difference between the online feature store and the offline feature store in a recommendation system? Why can a single database not serve both?

Level 2 — Apply

A recommendation system has 5M items. Candidate generation returns 500 items. The scoring model takes 0.5ms to score a single item. (a) If items are scored sequentially, what is the scoring latency? (b) If all 500 items are batched into a single model call that takes 50ms, what is the total recommendation latency? (c) What AT code describes the tradeoff between batch size and latency?
A feature store records user’s last 10 interactions with TTL of 24 hours. A user interacts with a product, then immediately refreshes recommendations. (a) How fresh is the interaction feature? (b) If the feature is in the offline store with a 6-hour update pipeline, what FM code describes the staleness? (c) What store type and pipeline would you use to serve sub-second freshness?

Level 3 — Design

A music streaming platform has 60M tracks and 50M monthly active users. Design a recommendation system for the “Daily Mix” feature (6 personalised playlists, generated nightly for each user). Requirements: playlists generated within a 4-hour nightly window, each playlist personalised to a specific mood or genre cluster, must include both known favourites and new discoveries. Design the candidate generation, scoring, and re-ranking pipeline. How do you handle FM4 training-serving skew? What is the feedback loop (T11) and how do you detect if it is degrading?

A complete answer will: (1) design a three-stage pipeline — candidate generation (ANN retrieval from user embedding against 60M track embeddings, retrieving top-1000 candidates per playlist), scoring (a ranking model applied to candidates using user history features and track metadata), and re-ranking (diversity injection to ensure new discoveries meet a stated fraction of each playlist, e.g., 20% new tracks) — with a concrete 4-hour budget estimate across 50M users showing the pipeline must parallelise across a compute cluster, (2) name FM4 (stale data / training-serving skew) and identify its specific form here: the training data distribution differs from serving distribution because the model trained on historical plays, which are biased toward tracks that were already popular — the mitigation is logging exploration plays separately and retraining on a debiased dataset using inverse propensity scoring, (3) describe the feedback loop and its degradation signal: if the model is retrained on its own recommendations, it amplifies whatever genres it initially surfaced, narrowing diversity over time — detectable by tracking the entropy of genre distribution in generated playlists across weekly model retraining cycles, and (4) propose a concrete discovery mechanism with its AT9 tradeoff: injecting tracks from underexplored genres at re-ranking improves discovery but degrades short-term engagement (measured by skip rate) — the answer must state an explicit exploration rate and how it is tuned against engagement metrics.

Read in the book →

← Real-World Variants