The Computing Series

Each Archetype in Detail

A1 — Search & Discovery

What defines it: Users are trying to find something they cannot directly address — a document, a product, a person, a place. The system’s value is its ability to return relevant results quickly from a large corpus.

Primary concern: Relevance and latency. A search result that takes 2 seconds is used less than one that takes 200ms, even if the 2-second result is marginally more relevant.

Characteristic infrastructure: Inverted index, trie or prefix tree for autocomplete, vector index for semantic search, ranking pipeline (features → model → score), CDN for result caching, offline indexing pipeline (crawl → parse → index → rank).

Characteristic tradeoffs: - Precomputation vs On-Demand (AT4): the inverted index is heavily precomputed; ranking computation is bounded at query time - Correctness vs Performance (AT9): approximate nearest-neighbour search (HNSW) over exact search - Consistency vs Availability (AT1): slight index staleness (documents indexed seconds to minutes after ingestion) is acceptable

Common failure modes: FM6 (Hotspotting) on high-frequency query terms. FM8 (Schema Violation) when the index schema changes without reindexing all documents. FM9 (Silent Data Corruption) when ranking features are silently stale.

Examples: Web search, product search, autocomplete, recommendation engines, document retrieval, RAG systems.


A2 — Social & Communication

What defines it: Users interact with each other in real time or near-real time. The system must route content between users at massive scale while maintaining ordering guarantees and presence awareness.

Primary concern: Delivery latency and fan-out scalability. A message that arrives 2 seconds late in a chat is acceptable; one that arrives 2 minutes late in a payment notification is not.

Characteristic infrastructure: WebSocket servers for real-time connections, message queue for async delivery, presence service, notification fan-out pipeline, graph database or adjacency list for the social graph, content storage with CDN for media attachments.

Characteristic tradeoffs: - Synchronous vs Asynchronous (AT10): real-time delivery via WebSocket is synchronous; notification fan-out is asynchronous - Precomputation vs On-Demand (AT4): push fan-out (precomputed per recipient) vs pull fan-out (assembled at read time) is the defining architectural choice - Centralisation vs Distribution (AT5): a per-region message routing service vs global routing

Common failure modes: FM3 (Unbounded Resource Consumption) when WebSocket connections accumulate beyond server capacity. FM6 (Hotspotting) on celebrity accounts when fan-out is push-based. FM7 (Thundering Herd) on service restarts when all long-lived connections reconnect simultaneously.

Examples: Chat systems (Slack, WhatsApp), social networks (Twitter, Instagram), notification systems, collaborative editing, live streaming with chat.


A3 — Marketplace & Transaction

What defines it: The system matches parties who want to exchange value and facilitates the exchange reliably. No double charges. No lost transactions. Exactly-once semantics on money movement.

Primary concern: Correctness and consistency. A payment system that charges a customer twice is not a latency problem — it is a correctness problem that destroys user trust immediately.

Characteristic infrastructure: Payment gateway integration, idempotency key store, saga orchestrator or event-driven compensation, inventory reservation service, distributed lock manager (for seat reservations, inventory), audit log.

Characteristic tradeoffs: - Consistency vs Availability (AT1): set firmly toward consistency for financial operations — the cost of a double charge exceeds the cost of reduced availability - Synchronous vs Asynchronous (AT10): payment confirmation is synchronous; inventory updates downstream are asynchronous - Correctness vs Performance (AT9): idempotency checks add a database lookup to every operation; the correctness guarantee is non-negotiable

Common failure modes: FM9 (Silent Data Corruption) when idempotency is not enforced — duplicate payments. FM12 (Split-Brain) in the saga coordinator. FM8 (Schema Violation) when the payment API changes break existing callers.

Examples: Payment processors (Stripe), ride-sharing (Uber), e-commerce (checkout flow), seat booking, stock trading.


A4 — Media Delivery

What defines it: The system stores, processes, and serves large binary assets — video, audio, images — to users globally at low latency. The bottleneck is network bandwidth and storage cost, not compute.

Primary concern: CDN hit rate and storage cost. A media system that serves 99% of requests from CDN edge nodes is cheap to run and fast for users. One that falls through to origin for 30% of requests is expensive and slow.

Characteristic infrastructure: Object storage (S3-compatible), transcoding pipeline (video), CDN with tiered caching, adaptive bitrate streaming (HLS/DASH), content-addressed storage (hash of content = storage key), storage tiering (hot/warm/cold).

Characteristic tradeoffs: - Precomputation vs On-Demand (AT4): transcode to all resolutions upfront vs transcode on first request - Storage vs Performance (AT4): store all quality variants (fast, expensive) vs store source only (cheap, slower first play) - Centralisation vs Distribution (AT5): CDN is distribution at its most literal — copies of content at edge nodes globally

Common failure modes: FM6 (Hotspotting) on a single piece of viral content that overwhelms the CDN or origin before it is cached widely. FM8 (Schema Violation) when the encoding format or manifest format changes and older clients cannot parse it. FM11 (Observability Blindness) on CDN hit rates — if the metric is not measured, cache inefficiency is invisible.

Examples: Video streaming (YouTube, Netflix), image hosting (Instagram, Imgur), audio streaming (Spotify), CDN itself (Cloudflare, Fastly).


A5 — Data Intelligence

What defines it: The system ingests user or operational events, processes them (batch or streaming), stores the results, and makes them queryable for analytics, ML training, or real-time intelligence.

Primary concern: Data quality and pipeline reliability. An analytics system that silently drops 1% of events produces numbers that cannot be trusted — and may not be detected as incorrect for weeks.

Characteristic infrastructure: Event ingestion API (often via a message queue for buffering), stream processor (Flink, Spark Streaming), data warehouse (BigQuery, Snowflake), feature store (for ML), vector database (for embeddings), dashboard/query layer.

Characteristic tradeoffs: - Latency vs Correctness (AT2 × AT9): Lambda architecture is the architectural answer — streaming layer for fast approximate results, batch layer for slow exact results - Storage vs Reprocessability (AT4): store raw events (enables full reprocessing) vs store aggregates only (cheaper, but cannot recompute with different business logic) - Synchronous vs Asynchronous (AT10): all data pipelines are inherently asynchronous; the question is the latency target (seconds, minutes, hours)

Common failure modes: FM8 (Schema Violation) when event schemas change without updating consumers. FM9 (Silent Data Corruption) from late-arriving events that corrupt time-window aggregations. FM4 (Data Consistency Failure) between the real-time and batch layers.

Examples: Analytics pipelines, log aggregation (ELK stack), ML feature stores, A/B testing platforms, fraud detection systems, observability platforms.


A6 — Platform & API

What defines it: The system exposes infrastructure or capability as a service for other engineers to build on. The “user” is another engineer, not an end user. Developer experience is the product quality metric.

Primary concern: Reliability and backwards compatibility. A platform that changes its API in breaking ways costs its users weeks of migration work. One that is unavailable costs its users availability proportional to how deeply they depend on it.

Characteristic infrastructure: API gateway (rate limiting, auth, routing), SDK client libraries, versioning infrastructure (schema registry, API version management), health check and status page, webhook delivery, idempotency key support.

Characteristic tradeoffs: - Consistency vs Availability (AT1): a platform must be more reliable than its consumers — the availability target is usually 99.99% or higher - Generality vs Specialisation (AT6): a general-purpose platform serves more use cases but is harder to optimise; specialised platforms (e.g., a payments platform) can be highly optimised for their domain - Coupling vs Cohesion (AT8): a platform with a clean, stable API boundary allows consumers to evolve independently

Common failure modes: FM8 (Schema / Contract Violation) — Hyrum’s Law (F9 #16) means every observable behaviour becomes a contract; platform changes break consumers. FM2 (Cascading Failures) — if the platform is unavailable, all consumers fail simultaneously.

Examples: AWS (infrastructure platform), Stripe (payments platform), Twilio (communications platform), internal developer platforms, API gateways.


Read in the book →