Exercises

Level 2 — Apply

A social network stores user profiles in a single PostgreSQL table with 200 million rows. The primary key is an auto-incrementing integer. The team decides to shard across 8 MySQL nodes using modulo hash partitioning (shard = user_id % 8).

The team needs to add a ninth shard. Estimate what fraction of rows must move. Write pseudocode for the migration process that keeps the system available during the move.
The team discovers that 80% of reads are for users created in the last 30 days. Design an alternative partitioning scheme that exploits this access pattern. What query patterns does your scheme support that modulo hashing cannot?
A “power user” account generates 10,000 API requests per minute. All requests hash to shard 4. Describe two mitigations. For each, specify what additional data structure or coordination mechanism is required.
The team wants to support the query “find all users registered between 2023-01-01 and 2023-06-30” efficiently. Which partitioning scheme would you choose? What is the tradeoff you accept?

Level 3 — Design

Design the partitioning strategy for a real-time analytics system that ingests 1 million events per second and must answer two types of queries: (1) “what happened to user X in the last 5 minutes?” and (2) “how many events of type Y occurred across all users in the last hour?”

Explain why no single partitioning key satisfies both query types simultaneously. What does this tell you about the system design?
Propose an architecture that handles both query types without requiring a full-cluster scan for either. Your design may use multiple physical data stores with different partitioning strategies. Specify the partition key, the routing mechanism, and the synchronisation mechanism for each store.
The system must add nodes during peak traffic without pausing ingest. Specify the rebalancing protocol. What invariants must hold during the rebalancing window? How would you verify them?
Three months after launch, the team notices that events of type “purchase” are 50× more frequent than any other event type, and all purchase events partition to the same shard under your original design. Describe the hotspot this creates and propose a modification to your design that addresses it without changing the query API.

A complete answer will: (1) correctly explain why user_id and event_type cannot both be partition keys for the two query types and what dual-store architecture this necessitates, (2) identify FM6 (hotspot cascade on the purchase-type shard) as the failure mode and quantify the imbalance (50× normal load), (3) address the AT5 tradeoff between sub-partitioning purchase events (distributes load, requires query-time fan-out) and routing all queries through a central aggregation tier (simpler, creates a bottleneck), and (4) propose a concrete modification — such as salting the partition key for high-frequency event types — that spreads load without requiring the query API to change.

Read in the book →

← Real Systems