Tree Patterns Everywhere

Introduction

Open a PostgreSQL table with a million rows and query for a single value. PostgreSQL finds it in under a millisecond. Not because it scanned all million rows — because it maintained a sorted tree of keys on disk, and each comparison eliminates half the remaining candidates.

Open a directory with 100,000 files and look up a filename. The operating system finds it in microseconds. Not by scanning — by traversing a tree structure that organises filenames by their position in a sorted order.

Parse a Python expression and evaluate it. The interpreter builds a tree of nodes, traverses it in post-order, and evaluates each node as it goes.

Three different domains. Three different systems. One structure: a tree. The specific variant — B-tree, directory tree, parse tree — differs by the constraints of the storage medium. The traversal logic is the same in all three.

Thread Activation

In Book 1, you built binary search trees (Ch 25) and saw why B-trees minimise disk reads (Ch 26). The core insight of Ch 26: a disk read has a fixed cost regardless of how much data you read within a page boundary. A B-tree node is sized to match a disk page, so each level of the tree costs exactly one disk read.

This chapter traces Thread 2 (Trees) into three production systems. In each case, the specific tree variant is determined by the same analysis: what are the constraints of the storage medium, and what tree structure matches those constraints? The thread continues in Book 3, Ch 9, where B-tree indexes become the foundation of query optimisation.

The Concept

A tree is a connected, acyclic graph. Its two fundamental operations are insertion (place a new node in its correct position) and search (find a node by key). Both operations traverse from the root toward a leaf, making a comparison at each node to determine which child to follow.

The performance of these operations depends on tree height. A balanced tree with N nodes has height O(log N). An unbalanced tree (a pathological case where each node has only one child) has height O(N) — equivalent to a linked list.

Three factors vary across tree variants: - Branching factor — how many children a node can have (binary = 2, B-tree = hundreds) - Balance guarantee — whether the tree maintains equal height across branches (BST: none by default; AVL, Red-Black, B-tree: guaranteed) - Node size — how much data fits in a node (BST: one key; B-tree: hundreds of keys, sized to match a disk page)

Each production tree variant chooses these factors based on the constraints of where it lives.

How It Works

PostgreSQL: The B-Tree Index

A PostgreSQL index on a column stores the column values in a B-tree. Each B-tree node is one disk page (8 KB by default). A page holds ~400 index entries. With 400 entries per page and a balanced tree, a table with 400³ = 64 million rows requires a B-tree with only 3 levels. Three disk reads to find any row in a 64-million-row table.

The B-tree index page layout:

A heap pointer is a (block_number, tuple_offset) pair that tells PostgreSQL where the actual row lives in the table’s heap storage. The index stores keys; the heap stores full rows. An index scan first searches the B-tree for the key (O(log N) disk reads), then fetches the heap page (one more disk read).

# Simulating B-tree search logic (simplified, in-memory)
class BTreeNode:
    def __init__(self, is_leaf: bool = True, order: int = 4):
        self.is_leaf = is_leaf
        self.keys: list = []           # sorted keys in this node
        self.children: list = []        # child nodes (internal) or heap pointers (leaf)
        self.order = order              # max keys per node

    def search(self, key) -> tuple | None:
        """Binary search within this node's keys."""
        lo, hi = 0, len(self.keys) - 1
        while lo <= hi:
            mid = (lo + hi) // 2
            if self.keys[mid] == key:
                return (self, mid)
            elif self.keys[mid] < key:
                lo = mid + 1
            else:
                hi = mid - 1
        return None

def btree_search(node: BTreeNode, key) -> str | None:
    """
    Search for key in B-tree rooted at node.
    Returns heap pointer (simulated as string) or None.
    """
    result = node.search(key)
    if result:
        _, idx = result
        return node.children[idx]  # leaf: heap pointer

    if node.is_leaf:
        return None

    # Find correct child: first key > search key determines which subtree
    for i, k in enumerate(node.keys):
        if key < k:
            return btree_search(node.children[i], key)
    return btree_search(node.children[-1], key)

The critical property: at every level, the number of pages read is one (one disk I/O to load a node). The tree height is log_order(N). With order=400, height = log₄₀₀(N). For N = 64 million: height = 3.

This is why B-trees dominate database indexing. Not because binary search trees are wrong — but because B-trees are binary search trees redesigned for a storage medium where reading one byte costs the same as reading 8 KB.

File Systems: Directory Trees

Every major file system (ext4, APFS, NTFS, HFS+) stores directory entries in a tree. The tree is traversed whenever you resolve a path.

When you open /usr/local/bin/python, the operating system: 1. Starts at the root directory inode 2. Looks up usr in the root directory’s tree of entries → gets the inode for /usr 3. Looks up local in /usr’s tree → gets the inode for /usr/local 4. Looks up bin → inode for /usr/local/bin 5. Looks up python → inode for the file

Each step is a tree lookup in the directory structure for that path component. On ext4, directories with more than a few entries use an HTree (a hash tree): directory entries are hashed and stored in a B-tree-like structure, giving O(log N) lookup per directory component.

# Simplified path resolution
def resolve_path(path: str, filesystem: dict) -> int | None:
    """
    Resolve an absolute path to an inode number.
    filesystem: dict mapping inode -> {name: inode} for directories
    Returns inode number or None if not found.
    """
    components = [c for c in path.split('/') if c]
    current_inode = filesystem['root']

    for component in components:
        directory = filesystem.get(current_inode, {})
        if component not in directory:
            return None
        current_inode = directory[component]

    return current_inode

The number of disk reads for a path resolution is proportional to the path depth (number of / separators) multiplied by the height of each directory’s internal tree. A file at depth 5 with 10,000 files per directory requires 5 × O(log 10,000) = 5 × ~14 = ~70 comparisons. Still fast — but deep paths in crowded directories are slower than shallow paths in sparse directories.

Parsers: Abstract Syntax Trees

Every language interpreter and compiler converts source code into an abstract syntax tree (AST) before execution or compilation. The AST represents the logical structure of the program, stripped of syntactic noise (parentheses, commas, semicolons).

Consider the expression (a + b) * (c - d). The AST is:

       *
      / \
     +   -
    / \ / \
   a  b c  d

Parsing builds this tree; evaluation traverses it. The standard traversal order for arithmetic expressions is post-order: process both children before the parent. Post-order traversal of the tree above visits: a, b, +, c, d, -, * — which is exactly the order a stack-based interpreter evaluates the expression.

from dataclasses import dataclass
from typing import Union

@dataclass
class BinaryOp:
    op: str
    left: 'Node'
    right: 'Node'

@dataclass
class Literal:
    value: float

Node = Union[BinaryOp, Literal]

def evaluate(node: Node) -> float:
    """
    Post-order traversal of an AST.
    Children are evaluated before the operator is applied.
    """
    if isinstance(node, Literal):
        return node.value

    left_val = evaluate(node.left)   # evaluate left subtree first
    right_val = evaluate(node.right) # then right subtree

    # Apply operator at this node (post-order: parent processed last)
    ops = {'+': lambda a, b: a + b,
           '-': lambda a, b: a - b,
           '*': lambda a, b: a * b,
           '/': lambda a, b: a / b}
    return ops[node.op](left_val, right_val)

# (a + b) * (c - d) with a=3, b=4, c=10, d=2
tree = BinaryOp('*',
    BinaryOp('+', Literal(3), Literal(4)),
    BinaryOp('-', Literal(10), Literal(2)))

result = evaluate(tree)  # ((3+4) * (10-2)) = 56

The AST’s height is the nesting depth of the expression. Deep nesting (long chains of function calls, deeply nested conditionals) produces tall trees. Recursion-based traversal on a tall tree risks stack overflow — a practical constraint that compilers address by transforming tail recursion or by using iterative traversal with an explicit stack.

Tradeoffs

AT4 — Precomputation vs. On-Demand

A B-tree index is precomputed at write time: every INSERT or UPDATE must update the index. The benefit arrives at query time: searches that would scan the entire table instead walk the tree. The tradeoff is explicit in PostgreSQL’s query planner: for small tables or queries that touch most rows, a sequential scan is faster than an index scan (because the heap fetches following each pointer cause random I/O, which is slower than sequential I/O for large result sets). For large tables and selective queries, the index wins.

AT3 — Simplicity vs. Flexibility

A binary search tree is simple: two children per node, straightforward to implement. A B-tree is complex: variable numbers of children, node splits and merges on insertion and deletion, careful page management. The flexibility a B-tree provides — performance on disk-based storage — requires substantially more implementation complexity. PostgreSQL’s B-tree index code spans thousands of lines. The binary search tree from Book 1 is fewer than 100.

Where It Fails

FM5 — Latency Amplification

An index lookup is faster than a full table scan only when the number of heap fetches is small relative to the table size. For a query that returns 30% of a table, the index requires 30% of N heap page fetches, each a random I/O. A sequential table scan reads pages sequentially. Sequential I/O is 10–100× faster than random I/O on spinning disks; the sequential scan wins. PostgreSQL’s query planner estimates this and chooses accordingly. Forcing an index scan on a poorly selective query amplifies latency.

FM8 — Schema/Contract Violation

B-tree indexes impose an ordering on keys. Changing a column’s data type (e.g., from integer to string) requires rebuilding the index, because the ordering changes. This is a schema migration that takes time proportional to table size. In large tables, this migration can take hours and locks the table. The B-tree’s dependence on ordering makes schema changes expensive — a contract between the column type and the index structure.

Real Systems

MySQL InnoDB uses B+ trees (a variant where all data is stored in leaf nodes, with leaf nodes linked in a doubly-linked list for efficient range scans). The clustered index is the primary key B+ tree; secondary indexes point to the primary key, not directly to the row. A secondary index lookup requires two B-tree traversals: one to find the primary key, one to find the row.

Linux ext4 uses HTree indexes for directories with more than ~10 entries. The HTree is a two-level radix tree based on a hash of the filename. Lookup is O(1) average (one hash computation, two tree levels). Large directories (/usr/bin with thousands of entries) benefit from this design; the alternative would be linear scan of directory blocks.

Python’s ast module exposes the AST for any Python expression or statement. Linters (pylint, flake8), static type checkers (mypy), and code formatters (black) work by traversing this AST and applying rules at each node. The AST’s structure makes it easy to find all function calls, all variable assignments, or all return statements without writing a custom parser.

Concept: B-Tree Index (and tree structures in production systems)

Core Idea: A B-tree is a BST redesigned for disk-based storage: nodes hold hundreds of keys (sized to match a disk page), so tree height is log₄₀₀(N) instead of log₂(N). Three disk reads can search 64 million rows. The same principle appears in file system directories and parse trees — each tree variant matches its branching factor and node size to the constraints of its storage medium.

Tradeoff: AT4 — Precomputation vs. On-Demand (the index is precomputed at write time; it pays off only for selective queries; sequential scans beat index scans when many rows match)

Failure Mode: FM5 — Latency Amplification (index lookup triggers random heap fetches; for low-selectivity queries this is slower than sequential scan)

Signal: When query latency is high and the query filters on a column — check if an index exists and whether the query’s selectivity justifies using it. When table inserts are slow — check write amplification from index maintenance.

Exercises

Level 1 — Understand

Why is a B-tree’s branching factor (order) much larger than 2? What property of disk storage drives this design decision?
What is the difference between an index scan and a sequential table scan? Under what conditions does each win?
What traversal order does a stack-based interpreter use to evaluate an arithmetic expression AST? Why this order?

Level 2 — Apply

A PostgreSQL table has 10 million rows. The B-tree index on the primary key uses 8 KB pages and stores 400 entries per page.

What is the height of this B-tree?
How many disk reads does a lookup by primary key require?
A query selects rows where status = 'active' and 5% of rows have status = 'active'. There is an index on status. Should PostgreSQL use the index? Justify with complexity analysis.

Level 3 — Design

You are designing a file metadata service for a cloud storage system. The service stores metadata (name, size, owner, permissions, timestamps) for 100 billion files across 10 million directories. Directories can contain up to 1 million files each.

Choose a data structure for directory lookups (finding a file by name within a directory). Justify your choice given the directory size distribution.
The system needs to support prefix search: list all files whose name starts with “report_2024”. How does this change your data structure choice? Name the tradeoff (AT notation).
The system requires that listing all files in a directory returns results in alphabetical order. How does this constraint affect your design? What is the cost?

A complete answer will: walk the reasoning step by step, cite the relevant tradeoff and failure-mode codes, and show the calculation or invariant that supports the conclusion.

Read in the book →