Exercises

Level 1 — Understand

Why is a B-tree’s branching factor (order) much larger than 2? What property of disk storage drives this design decision?
What is the difference between an index scan and a sequential table scan? Under what conditions does each win?
What traversal order does a stack-based interpreter use to evaluate an arithmetic expression AST? Why this order?

Level 2 — Apply

A PostgreSQL table has 10 million rows. The B-tree index on the primary key uses 8 KB pages and stores 400 entries per page.

What is the height of this B-tree?
How many disk reads does a lookup by primary key require?
A query selects rows where status = 'active' and 5% of rows have status = 'active'. There is an index on status. Should PostgreSQL use the index? Justify with complexity analysis.

Level 3 — Design

You are designing a file metadata service for a cloud storage system. The service stores metadata (name, size, owner, permissions, timestamps) for 100 billion files across 10 million directories. Directories can contain up to 1 million files each.

Choose a data structure for directory lookups (finding a file by name within a directory). Justify your choice given the directory size distribution.
The system needs to support prefix search: list all files whose name starts with “report_2024”. How does this change your data structure choice? Name the tradeoff (AT notation).
The system requires that listing all files in a directory returns results in alphabetical order. How does this constraint affect your design? What is the cost?

A complete answer will: (1) select and justify a data structure for directory lookups that handles both hash map–speed lookups and prefix search (e.g., B-tree or trie variant), named with its time complexity for each operation, (2) identify FM6 (hotspot cascade) as the risk when million-file directories are scanned and sorted on read, (3) address the AT2 tradeoff between keeping files sorted at write time (O(log n) insert cost, O(n) listing free) versus sorting at read time (O(1) insert, O(n log n) listing cost), and (4) propose a concrete storage layout that satisfies both prefix search and alphabetical listing efficiently at the 100-billion-file scale.

Read in the book →

← Real Systems