Netflix once ran a $1M competition to improve its recommendation algorithm by 10%. Hundreds of teams competed for three years. The winning team improved the algorithm by 10.06% — barely over the threshold. Netflix never fully deployed the winning system. The winning algorithm was an ensemble of 107 models that required too much compute time to run in production. The recommendation system deployed was simpler, less accurate, and fast enough.
That contest outcome captures the central tension of recommendation systems: accuracy and latency are in constant tension. The most accurate models require the most compute. Production systems must serve recommendations in under 100ms. The architecture of a recommendation engine is the story of how to get accuracy close to the theoretical maximum within an engineering budget.