What defines it: Users are trying to find something they cannot directly address — a document, a product, a person, a place. The system’s value is its ability to return relevant results quickly from a large corpus.
Primary concern: Relevance and latency. A search result that takes 2 seconds is used less than one that takes 200ms, even if the 2-second result is marginally more relevant.
Characteristic infrastructure: Inverted index, trie or prefix tree for autocomplete, vector index for semantic search, ranking pipeline (features → model → score), CDN for result caching, offline indexing pipeline (crawl → parse → index → rank).
Characteristic tradeoffs: - Precomputation vs On-Demand (AT4): the inverted index is heavily precomputed; ranking computation is bounded at query time - Correctness vs Performance (AT9): approximate nearest-neighbour search (HNSW) over exact search - Consistency vs Availability (AT1): slight index staleness (documents indexed seconds to minutes after ingestion) is acceptable
Common failure modes: FM6 (Hotspotting) on high-frequency query terms. FM8 (Schema Violation) when the index schema changes without reindexing all documents. FM9 (Silent Data Corruption) when ranking features are silently stale.
Examples: Web search, product search, autocomplete, recommendation engines, document retrieval, RAG systems.
What defines it: The system matches parties who want to exchange value and facilitates the exchange reliably. No double charges. No lost transactions. Exactly-once semantics on money movement.
Primary concern: Correctness and consistency. A payment system that charges a customer twice is not a latency problem — it is a correctness problem that destroys user trust immediately.
Characteristic infrastructure: Payment gateway integration, idempotency key store, saga orchestrator or event-driven compensation, inventory reservation service, distributed lock manager (for seat reservations, inventory), audit log.
Characteristic tradeoffs: - Consistency vs Availability (AT1): set firmly toward consistency for financial operations — the cost of a double charge exceeds the cost of reduced availability - Synchronous vs Asynchronous (AT10): payment confirmation is synchronous; inventory updates downstream are asynchronous - Correctness vs Performance (AT9): idempotency checks add a database lookup to every operation; the correctness guarantee is non-negotiable
Common failure modes: FM9 (Silent Data Corruption) when idempotency is not enforced — duplicate payments. FM12 (Split-Brain) in the saga coordinator. FM8 (Schema Violation) when the payment API changes break existing callers.
Examples: Payment processors (Stripe), ride-sharing (Uber), e-commerce (checkout flow), seat booking, stock trading.
What defines it: The system stores, processes, and serves large binary assets — video, audio, images — to users globally at low latency. The bottleneck is network bandwidth and storage cost, not compute.
Primary concern: CDN hit rate and storage cost. A media system that serves 99% of requests from CDN edge nodes is cheap to run and fast for users. One that falls through to origin for 30% of requests is expensive and slow.
Characteristic infrastructure: Object storage (S3-compatible), transcoding pipeline (video), CDN with tiered caching, adaptive bitrate streaming (HLS/DASH), content-addressed storage (hash of content = storage key), storage tiering (hot/warm/cold).
Characteristic tradeoffs: - Precomputation vs On-Demand (AT4): transcode to all resolutions upfront vs transcode on first request - Storage vs Performance (AT4): store all quality variants (fast, expensive) vs store source only (cheap, slower first play) - Centralisation vs Distribution (AT5): CDN is distribution at its most literal — copies of content at edge nodes globally
Common failure modes: FM6 (Hotspotting) on a single piece of viral content that overwhelms the CDN or origin before it is cached widely. FM8 (Schema Violation) when the encoding format or manifest format changes and older clients cannot parse it. FM11 (Observability Blindness) on CDN hit rates — if the metric is not measured, cache inefficiency is invisible.
Examples: Video streaming (YouTube, Netflix), image hosting (Instagram, Imgur), audio streaming (Spotify), CDN itself (Cloudflare, Fastly).
What defines it: The system ingests user or operational events, processes them (batch or streaming), stores the results, and makes them queryable for analytics, ML training, or real-time intelligence.
Primary concern: Data quality and pipeline reliability. An analytics system that silently drops 1% of events produces numbers that cannot be trusted — and may not be detected as incorrect for weeks.
Characteristic infrastructure: Event ingestion API (often via a message queue for buffering), stream processor (Flink, Spark Streaming), data warehouse (BigQuery, Snowflake), feature store (for ML), vector database (for embeddings), dashboard/query layer.
Characteristic tradeoffs: - Latency vs Correctness (AT2 × AT9): Lambda architecture is the architectural answer — streaming layer for fast approximate results, batch layer for slow exact results - Storage vs Reprocessability (AT4): store raw events (enables full reprocessing) vs store aggregates only (cheaper, but cannot recompute with different business logic) - Synchronous vs Asynchronous (AT10): all data pipelines are inherently asynchronous; the question is the latency target (seconds, minutes, hours)
Common failure modes: FM8 (Schema Violation) when event schemas change without updating consumers. FM9 (Silent Data Corruption) from late-arriving events that corrupt time-window aggregations. FM4 (Data Consistency Failure) between the real-time and batch layers.
Examples: Analytics pipelines, log aggregation (ELK stack), ML feature stores, A/B testing platforms, fraud detection systems, observability platforms.
What defines it: The system exposes infrastructure or capability as a service for other engineers to build on. The “user” is another engineer, not an end user. Developer experience is the product quality metric.
Primary concern: Reliability and backwards compatibility. A platform that changes its API in breaking ways costs its users weeks of migration work. One that is unavailable costs its users availability proportional to how deeply they depend on it.
Characteristic infrastructure: API gateway (rate limiting, auth, routing), SDK client libraries, versioning infrastructure (schema registry, API version management), health check and status page, webhook delivery, idempotency key support.
Characteristic tradeoffs: - Consistency vs Availability (AT1): a platform must be more reliable than its consumers — the availability target is usually 99.99% or higher - Generality vs Specialisation (AT6): a general-purpose platform serves more use cases but is harder to optimise; specialised platforms (e.g., a payments platform) can be highly optimised for their domain - Coupling vs Cohesion (AT8): a platform with a clean, stable API boundary allows consumers to evolve independently
Common failure modes: FM8 (Schema / Contract Violation) — Hyrum’s Law (F9 #16) means every observable behaviour becomes a contract; platform changes break consumers. FM2 (Cascading Failures) — if the platform is unavailable, all consumers fail simultaneously.
Examples: AWS (infrastructure platform), Stripe (payments platform), Twilio (communications platform), internal developer platforms, API gateways.