How It Works

Web Crawler: BFS on the Web Graph

A web crawler’s job is to discover and index the web. The web is a directed graph: pages are nodes; hyperlinks are directed edges. The crawler starts at seed URLs and expands outward, discovering new pages by following links.

BFS is the correct traversal because BFS explores pages closest to the seeds first. This matters for relevance: pages linked from authoritative starting points are more important than pages reachable only through long chains. A BFS crawler stays close to the seed set for longer; a DFS crawler might dive into one remote corner of the web and not return to the main body for a long time.

from collections import deque
from urllib.parse import urlparse
import time

class WebCrawler:
    """
    BFS web crawler with politeness delays and domain rate limiting.
    """
    def __init__(self, seed_urls: list[str], max_pages: int = 1000):
        self.frontier: deque[str] = deque(seed_urls)
        self.visited: set[str] = set(seed_urls)
        self.max_pages = max_pages
        self._last_visit: dict[str, float] = {}  # domain -> last fetch time

    def _is_polite(self, url: str, delay_seconds: float = 1.0) -> bool:
        """
        Politeness: do not fetch from the same domain more than once per delay_seconds.
        This prevents overloading individual servers.
        """
        domain = urlparse(url).netloc
        now = time.time()
        last = self._last_visit.get(domain, 0)
        return now - last >= delay_seconds

    def _fetch_and_extract_links(self, url: str) -> list[str]:
        """Fetch page and extract links. Real implementation uses HTTP client + HTML parser."""
        self._last_visit[urlparse(url).netloc] = time.time()
        # In production: HTTP GET, parse HTML, extract <a href="...">
        return []  # placeholder

    def crawl(self) -> list[str]:
        """BFS crawl. Returns list of visited URLs in BFS order."""
        visited_order = []

        while self.frontier and len(visited_order) < self.max_pages:
            url = self.frontier.popleft()  # FIFO: process oldest-discovered URLs first

            if not self._is_polite(url):
                self.frontier.append(url)  # requeue for later
                continue

            links = self._fetch_and_extract_links(url)
            visited_order.append(url)

            for link in links:
                if link not in self.visited:
                    self.visited.add(link)
                    self.frontier.append(link)  # add to back of queue (BFS)

        return visited_order

The visited set is the critical component. Without it, BFS on a graph with cycles is infinite — the crawler would follow A→B→A→B forever. The visited set ensures each node is processed once: O(V + E) total work.

The frontier can grow very large. A BFS that fans out aggressively can have millions of URLs in the frontier at once. Memory management of the frontier is a practical constraint that textbook BFS ignores: production crawlers use a combination of in-memory queues and on-disk queues (priority queues sorted by page rank estimate or domain).

Dependency Resolution: DFS on the Dependency DAG

A package manager must install all transitive dependencies of a requested package and determine the correct installation order (each dependency before the packages that depend on it). This is DFS on a dependency graph, combined with topological sort (Chapter 8).

DFS is appropriate here because we need complete dependency trees — we want to follow one path to its leaf (a package with no further dependencies) before backtracking. BFS would give us all level-1 dependencies before level-2 dependencies; DFS gives us a complete path, which is what recursive installation needs.

def resolve_dependencies(
    package: str,
    registry: dict[str, list[str]]
) -> list[str]:
    """
    DFS dependency resolution.
    registry: package_name -> [direct_dependencies]
    Returns installation order: dependencies before the packages that need them.
    Raises ValueError on circular dependency.
    """
    resolved: list[str] = []
    visiting: set[str] = set()   # currently in the DFS call stack (cycle detection)
    visited: set[str] = set()    # fully resolved

    def dfs(pkg: str) -> None:
        if pkg in visited:
            return
        if pkg in visiting:
            raise ValueError(f"Circular dependency detected: {pkg}")

        visiting.add(pkg)

        for dep in registry.get(pkg, []):
            dfs(dep)  # Resolve dependency first (DFS goes deep before backtracking)

        visiting.discard(pkg)
        visited.add(pkg)
        resolved.append(pkg)  # Add after all dependencies are resolved (post-order)

    dfs(package)
    return resolved

# Example: flask -> werkzeug, click; click -> colorama
registry = {
    'flask': ['werkzeug', 'click'],
    'click': ['colorama'],
    'werkzeug': [],
    'colorama': []
}

order = resolve_dependencies('flask', registry)
# order: ['werkzeug', 'colorama', 'click', 'flask']
# Dependencies always appear before the packages that need them

The two-set cycle detection (visiting vs visited) is essential. visiting tracks the current DFS path: if we encounter a node already in visiting, we have found a back edge — a cycle. visited tracks fully processed nodes: we do not re-process them.

Garbage Collection: BFS/DFS from GC Roots

A garbage collector identifies unreachable objects and reclaims their memory. The memory graph: every object is a node; every object reference (a pointer from one object to another) is a directed edge.

The mark phase of a mark-and-sweep GC is BFS or DFS starting from the GC roots — a set of guaranteed-live objects: global variables, stack-allocated variables, registers. Any object reachable from a root is live; anything not reachable is garbage.

class Object:
    def __init__(self, name: str):
        self.name = name
        self.references: list['Object'] = []  # outgoing references
        self.marked = False

def mark_phase(roots: list[Object]) -> set[Object]:
    """
    BFS mark phase: traverses the object reference graph from GC roots.
    Marks all reachable objects as live.
    Returns the set of live objects.
    """
    live = set()
    frontier = deque(roots)

    while frontier:
        obj = frontier.popleft()
        if obj in live:
            continue
        live.add(obj)
        obj.marked = True

        for referenced in obj.references:
            if referenced not in live:
                frontier.append(referenced)

    return live

def sweep_phase(all_objects: list[Object], live: set[Object]) -> list[Object]:
    """
    Sweep phase: free all objects not marked as live.
    Returns the list of collected (freed) objects.
    """
    garbage = [obj for obj in all_objects if obj not in live]
    # In a real GC, freed memory is returned to the allocator
    return garbage

The GC traversal must be conservative: it marks everything reachable, even if some reachable objects will never be accessed again by the program. This is the fundamental limitation of tracing GC — it cannot distinguish “reachable but will never be used” (a memory leak) from “reachable and needed.” This is why long-lived data structures that accumulate references cause memory leaks even in garbage-collected languages.

Read in the book →

← The Concept Tradeoffs →