BFS and DFS in Production

Introduction

In 1998, Sergey Brin and Larry Page described the algorithm they used to map the web. It was not sophisticated. It maintained a frontier of URLs to visit, fetched each page, extracted links, and added unvisited links to the frontier. The frontier was a queue. The traversal was breadth-first.

That same year, npm was a decade away from existing. When npm arrived, it needed to determine which packages to install when you added a dependency. It traversed the dependency graph starting from your package, visited each dependency recursively, and detected cycles to prevent infinite loops. The traversal used a stack (implicitly, through recursion). It was depth-first.

The same two graph traversal algorithms — BFS with a queue, DFS with a stack — appear in different forms across every layer of the computing stack. The structure of the problem determines which traversal you need. The engineering around that traversal (politeness delays, cycle detection, visited sets) is what distinguishes a working system from a homework assignment.

Thread Activation

In Book 1, Ch 28, you implemented BFS and DFS on adjacency lists. BFS uses a queue and explores level by level. DFS uses a stack (or recursive call stack) and explores one path fully before backtracking.

This chapter traces T3 (Graphs) using T4 (Queues): the frontier queue of BFS and the explicit or implicit stack of DFS are the data structures that define the traversal order. In Book 3, the graph thread continues with network topology, service mesh routing, and social graph storage at scale.

The Concept

BFS and DFS answer the same question — “which nodes are reachable from a starting node?” — but in different orders and with different properties.

Property	BFS	DFS
Data structure	Queue (FIFO)	Stack (LIFO) or recursion
Traversal order	Level by level (closest nodes first)	One path fully before backtracking
Finds shortest path?	Yes (in unweighted graphs)	No
Memory usage	O(width of graph)	O(depth of graph)
Cycle detection	Via visited set	Via visited set

The choice between BFS and DFS is a choice about which order you need results. BFS gives you the nearest results first. DFS gives you complete paths before backtracking.

How It Works

Web Crawler: BFS on the Web Graph

A web crawler’s job is to discover and index the web. The web is a directed graph: pages are nodes; hyperlinks are directed edges. The crawler starts at seed URLs and expands outward, discovering new pages by following links.

BFS is the correct traversal because BFS explores pages closest to the seeds first. This matters for relevance: pages linked from authoritative starting points are more important than pages reachable only through long chains. A BFS crawler stays close to the seed set for longer; a DFS crawler might dive into one remote corner of the web and not return to the main body for a long time.

from collections import deque
from urllib.parse import urlparse
import time

class WebCrawler:
    """
    BFS web crawler with politeness delays and domain rate limiting.
    """
    def __init__(self, seed_urls: list[str], max_pages: int = 1000):
        self.frontier: deque[str] = deque(seed_urls)
        self.visited: set[str] = set(seed_urls)
        self.max_pages = max_pages
        self._last_visit: dict[str, float] = {}  # domain -> last fetch time

    def _is_polite(self, url: str, delay_seconds: float = 1.0) -> bool:
        """
        Politeness: do not fetch from the same domain more than once per delay_seconds.
        This prevents overloading individual servers.
        """
        domain = urlparse(url).netloc
        now = time.time()
        last = self._last_visit.get(domain, 0)
        return now - last >= delay_seconds

    def _fetch_and_extract_links(self, url: str) -> list[str]:
        """Fetch page and extract links. Real implementation uses HTTP client + HTML parser."""
        self._last_visit[urlparse(url).netloc] = time.time()
        # In production: HTTP GET, parse HTML, extract <a href="...">
        return []  # placeholder

    def crawl(self) -> list[str]:
        """BFS crawl. Returns list of visited URLs in BFS order."""
        visited_order = []

        while self.frontier and len(visited_order) < self.max_pages:
            url = self.frontier.popleft()  # FIFO: process oldest-discovered URLs first

            if not self._is_polite(url):
                self.frontier.append(url)  # requeue for later
                continue

            links = self._fetch_and_extract_links(url)
            visited_order.append(url)

            for link in links:
                if link not in self.visited:
                    self.visited.add(link)
                    self.frontier.append(link)  # add to back of queue (BFS)

        return visited_order

The visited set is the critical component. Without it, BFS on a graph with cycles is infinite — the crawler would follow A→B→A→B forever. The visited set ensures each node is processed once: O(V + E) total work.

The frontier can grow very large. A BFS that fans out aggressively can have millions of URLs in the frontier at once. Memory management of the frontier is a practical constraint that textbook BFS ignores: production crawlers use a combination of in-memory queues and on-disk queues (priority queues sorted by page rank estimate or domain).

Dependency Resolution: DFS on the Dependency DAG

A package manager must install all transitive dependencies of a requested package and determine the correct installation order (each dependency before the packages that depend on it). This is DFS on a dependency graph, combined with topological sort (Chapter 8).

DFS is appropriate here because we need complete dependency trees — we want to follow one path to its leaf (a package with no further dependencies) before backtracking. BFS would give us all level-1 dependencies before level-2 dependencies; DFS gives us a complete path, which is what recursive installation needs.

def resolve_dependencies(
    package: str,
    registry: dict[str, list[str]]
) -> list[str]:
    """
    DFS dependency resolution.
    registry: package_name -> [direct_dependencies]
    Returns installation order: dependencies before the packages that need them.
    Raises ValueError on circular dependency.
    """
    resolved: list[str] = []
    visiting: set[str] = set()   # currently in the DFS call stack (cycle detection)
    visited: set[str] = set()    # fully resolved

    def dfs(pkg: str) -> None:
        if pkg in visited:
            return
        if pkg in visiting:
            raise ValueError(f"Circular dependency detected: {pkg}")

        visiting.add(pkg)

        for dep in registry.get(pkg, []):
            dfs(dep)  # Resolve dependency first (DFS goes deep before backtracking)

        visiting.discard(pkg)
        visited.add(pkg)
        resolved.append(pkg)  # Add after all dependencies are resolved (post-order)

    dfs(package)
    return resolved

# Example: flask -> werkzeug, click; click -> colorama
registry = {
    'flask': ['werkzeug', 'click'],
    'click': ['colorama'],
    'werkzeug': [],
    'colorama': []
}

order = resolve_dependencies('flask', registry)
# order: ['werkzeug', 'colorama', 'click', 'flask']
# Dependencies always appear before the packages that need them

The two-set cycle detection (visiting vs visited) is essential. visiting tracks the current DFS path: if we encounter a node already in visiting, we have found a back edge — a cycle. visited tracks fully processed nodes: we do not re-process them.

Garbage Collection: BFS/DFS from GC Roots

A garbage collector identifies unreachable objects and reclaims their memory. The memory graph: every object is a node; every object reference (a pointer from one object to another) is a directed edge.

The mark phase of a mark-and-sweep GC is BFS or DFS starting from the GC roots — a set of guaranteed-live objects: global variables, stack-allocated variables, registers. Any object reachable from a root is live; anything not reachable is garbage.

class Object:
    def __init__(self, name: str):
        self.name = name
        self.references: list['Object'] = []  # outgoing references
        self.marked = False

def mark_phase(roots: list[Object]) -> set[Object]:
    """
    BFS mark phase: traverses the object reference graph from GC roots.
    Marks all reachable objects as live.
    Returns the set of live objects.
    """
    live = set()
    frontier = deque(roots)

    while frontier:
        obj = frontier.popleft()
        if obj in live:
            continue
        live.add(obj)
        obj.marked = True

        for referenced in obj.references:
            if referenced not in live:
                frontier.append(referenced)

    return live

def sweep_phase(all_objects: list[Object], live: set[Object]) -> list[Object]:
    """
    Sweep phase: free all objects not marked as live.
    Returns the list of collected (freed) objects.
    """
    garbage = [obj for obj in all_objects if obj not in live]
    # In a real GC, freed memory is returned to the allocator
    return garbage

The GC traversal must be conservative: it marks everything reachable, even if some reachable objects will never be accessed again by the program. This is the fundamental limitation of tracing GC — it cannot distinguish “reachable but will never be used” (a memory leak) from “reachable and needed.” This is why long-lived data structures that accumulate references cause memory leaks even in garbage-collected languages.

Tradeoffs

AT3 — Simplicity vs. Flexibility

BFS with an explicit queue is simple and correct but memory-intensive for wide graphs. DFS with recursion is simple but limited by the call stack depth — a dependency graph with 10,000 transitive dependencies will overflow the default Python call stack (default limit: 1,000 frames). Production implementations use iterative DFS with an explicit stack. The recursive version is easier to understand; the iterative version is safer.

AT8 — Coupling vs. Cohesion

A web crawler must couple BFS traversal with domain politeness, URL normalisation, deduplication, and priority ranking. Separating these concerns cleanly — the traversal logic does not know about politeness; the scheduler decides ordering — makes the system testable and modifiable. Coupling them in one class makes the traversal faster to implement but harder to tune.

Where It Fails

FM3 — Unbounded Resource Consumption

BFS stores the entire frontier in memory. On a graph with high out-degree (web pages with hundreds of links, social graphs with high-follower accounts), the frontier grows rapidly. A BFS from a popular page can add millions of URLs to the frontier in the first few levels. Production crawlers cap frontier size and use disk-backed queues.

FM11 — Observability Blindness

A cycle in a dependency graph that is not detected will cause infinite recursion in DFS (stack overflow in most languages). The classic symptom is a dependency resolution command that hangs or crashes with a stack overflow rather than a clear error message. The fix is the two-set cycle detection shown above. The failure is nearly always from a dependency graph that includes a two-package mutual dependency that was introduced by a version upgrade.

Real Systems

Googlebot (Google’s web crawler) maintains a URL frontier of billions of URLs, prioritised by estimated page importance and crawl freshness. It is BFS in structure but with a priority queue rather than a plain queue — not pure BFS, but BFS-shaped.

npm and pip both use DFS for dependency resolution. The resolution algorithm also handles version constraints (not just graph connectivity), which converts the problem from O(V+E) to NP-hard in the worst case (SAT problem). Both package managers use heuristics to avoid the worst case.

Python’s garbage collector uses a combination of reference counting (immediate collection when count drops to zero) and a cyclic GC (BFS/DFS to collect reference cycles that reference counting cannot handle). The cyclic GC runs periodically, not on every allocation.

Concept: BFS and DFS in Production Systems

Core Idea: BFS (queue-based, level-by-level) finds shortest paths and explores nearest nodes first. DFS (stack-based, depth-first) explores complete paths before backtracking. Both require a visited set to handle cycles. Production BFS adds prioritisation, politeness, and frontier management; production DFS adds cycle detection via two sets (visiting + visited) and iterative implementation to avoid stack overflow.

Tradeoff: AT3 — Simplicity vs. Flexibility (recursive DFS is simple but stack-limited; iterative DFS is safe but more complex; BFS is always iterative but requires explicit frontier management)

Failure Mode: FM3 — Unbounded Resource Consumption (BFS frontier grows with graph width; DFS call stack grows with graph depth; both need explicit limits in production)

Signal: When you need the nearest or most important nodes first — BFS. When you need complete paths before backtracking, or need topological ordering — DFS. When the graph has cycles — both require a visited set.

Exercises

Level 1 — Understand

What data structure does BFS use? What data structure does DFS use? How does the choice of data structure determine the traversal order?
What is a GC root? Why does a garbage collector start its traversal from roots rather than from all objects?
A dependency graph has a cycle: A depends on B and B depends on A. What happens when DFS traverses this graph without cycle detection? How does the two-set approach detect the cycle?

Level 2 — Apply

A social network has 1 million users. Each user has on average 200 connections. You need to find all users reachable from user X within 3 “degrees of separation” (3 hops).

Which traversal would you use? Justify with reference to the properties of BFS vs DFS.
What is the maximum number of nodes in the BFS frontier at depth 3?
The frontier at depth 3 exceeds available memory. Propose a modification to the traversal that bounds memory usage while still returning results.

Level 3 — Design

You are building a dependency analysis service for a large monorepo with 50,000 modules. Engineers want to know: (a) what is the full transitive dependency set of module X, and (b) which modules transitively depend on module Y (reverse dependency lookup).

Choose a traversal for each query. Design the data structures for forward and reverse dependency graphs.
The dependency graph changes ~500 times per day as engineers modify module dependencies. How do you keep the forward and reverse graphs consistent on change?
Query (a) is called 10,000 times per day; the same modules are queried repeatedly. Design a caching strategy for dependency sets. Name every tradeoff using AT notation.

A complete answer will: separate the claim from its justification, name the tradeoff and failure mode by code, and state the assumption that the argument depends on.