A team has 3,000 end-to-end tests and 200 unit tests. The test suite takes 45 minutes to run. Flakiness is high: 15% of runs fail for reasons unrelated to code changes. Developers run tests locally only before pushing, and even then only sometimes. The test suite catches regressions rarely — not because the tests are wrong, but because they are so expensive to run that engineers avoid them.
A different team has 8,000 unit tests, 400 integration tests, and 50 end-to-end tests. Their suite runs in 4 minutes. Flakiness is below 1%. Developers run the full suite before every push. The first team ships with more regressions despite having more tests. Testing strategy, not test count, determines effectiveness.
You have already seen consensus (T9) in Book 4, where distributed nodes agree on a value through a defined protocol. Tests are consensus at code level: they express agreement between the specification (what the code should do) and the implementation (what the code does). A test that passes is a vote from the implementation that it matches the specification. You have also seen feedback loops (T11) throughout this book: the test suite is the feedback loop that makes regressions visible before they reach production. The speed of the feedback loop determines its effectiveness.
The test pyramid defines the proportion and purpose of each test layer.
Unit tests are the base of the pyramid: many tests,
fast execution, testing one unit of behaviour in isolation. “Isolation”
means dependencies are replaced with test doubles (stubs, fakes, mocks).
A unit test for calculateDiscount provides an
Order object and a DiscountCode and asserts on
the returned Money — no database, no network. Unit tests
are fast (milliseconds each), reliable (no external state), and targeted
(a failure points to a specific function). They cover pure logic and
business rules.
Integration tests test the behaviour of two or more
components together, or the behaviour of one component against real
infrastructure (a database, a message queue). An integration test for
PostgresOrderRepository.save creates a test database, calls
save, queries the database directly, and asserts on the
stored values. Slower than unit tests (100ms to 1s each), but they catch
the bugs that unit tests cannot: ORM mapping errors, SQL constraint
violations, transaction isolation issues.
Contract tests verify that a service’s published
interface matches what consumers expect. Consumer-driven contract tests
are defined by the consumer: “I expect the orders service to respond
with {id, status, items, total} when I call
GET /orders/:id.” The orders service runs these tests as
part of its test suite, verifying that it does not break its consumers
without knowing who they are. Contract tests catch FM8 (Schema/Contract
Violation) before deployment, not after.
End-to-end tests exercise the full system through its user-facing interface. They catch failures that only appear when all components are running together: configuration mismatches, environment-specific bugs, workflow failures. They are slow (seconds to minutes each), flaky (network, timing, external service state), and expensive to maintain. Their value is high for critical user journeys; their cost makes them unsuitable as the primary test layer.
The test pyramid implies: write many unit tests, some integration tests, a few contract tests, very few end-to-end tests. The test iceberg (inverted pyramid) — more end-to-end tests than unit tests — produces the situation described in the introduction: slow, flaky tests that developers avoid.
When to use each layer: - Unit tests: all pure functions, all business rules, all edge cases and error conditions - Integration tests: all database operations, all queue interactions, all external API clients - Contract tests: all service-to-service interfaces, all public API endpoints - End-to-end tests: critical user journeys (sign up, purchase, payment), smoke tests for deployment verification
Unit test for pure logic:
function testDiscountApplied():
order = buildTestOrder(items=[{sku: "A", price: 100, qty: 1}])
discountCode = DiscountCode(percentage=10, minOrderValue=50)
result = calculateDiscount(order, discountCode)
assert result.discountAmount == 10
assert result.finalTotal == 90
function testDiscountNotAppliedBelowMinimum():
order = buildTestOrder(items=[{sku: "A", price: 40, qty: 1}])
discountCode = DiscountCode(percentage=10, minOrderValue=50)
result = calculateDiscount(order, discountCode)
assert result.discountAmount == 0
assert result.finalTotal == 40
No infrastructure. These run in under one millisecond.
Integration test for database interaction:
function testOrderRepositorySave():
db = createTestDatabase() # spins up a test DB or uses a transaction that rolls back
repo = PostgresOrderRepository(db)
order = Order.create(testCart, testCustomer)
repo.save(order)
storedOrder = db.queryOne("SELECT * FROM orders WHERE id = ?", order.id)
assert storedOrder.status == "pending"
assert storedOrder.total == order.totalPrice()
assert storedOrder.customer_id == testCustomer.id
db.rollback() # clean up
Consumer-driven contract test:
# Consumer (warehouse service) defines what it expects from orders service:
contract OrdersServiceContract:
interaction GetOrder:
request:
method: GET
path: /orders/123
response:
status: 200
body:
id: String
status: String (one of: "pending", "confirmed", "shipped")
items: List<{sku: String, quantity: Integer, unitPrice: Money}>
total: Money
# Orders service runs this contract as a test:
function testGetOrderMatchesContract():
order = createTestOrder(id="123")
response = testClient.get("/orders/123")
assert response.status == 200
assert response.body.id == "123"
assert response.body.status in ["pending", "confirmed", "shipped"]
assert isValidMoney(response.body.total)
When the orders service team changes the response schema, they run the contract tests from all consuming services. If any fail, they must either preserve backward compatibility or coordinate a versioned update.
AT9 — Correctness/Performance: More tests at lower pyramid layers give better correctness assurance. A unit test that runs in one millisecond provides the same correctness signal for one function as an end-to-end test that runs in one minute — but the unit test enables immediate feedback. The performance (CI speed) of the test suite determines how often it is run, which determines how much it actually catches.
AT3 — Simplicity/Flexibility: Contract tests add process complexity: consumers must publish their contracts, producers must run consumer contracts, both parties must agree on versioning. For a system with one consumer and one producer, this overhead is not justified. For a system with ten consumers, contract tests prevent the coordination overhead of manually verifying every interface on every change.
FM8 — Schema/Contract Violation: The primary failure mode that testing strategy exists to prevent. Contract tests catch these before deployment. End-to-end tests catch them after deployment to a test environment. Without any contract enforcement, schema violations reach production.
FM11 — Observability Blindness: A test suite that tests only happy paths provides false confidence. When the production system encounters an error case (timeout, partial failure, invalid input), the untested code path runs for the first time in production. Testing strategy must include explicit coverage of error paths: what happens when the database is unavailable, what happens when the external API returns a 500, what happens when the input is malformed.
Google’s testing at scale: Google runs billions of tests per day across its codebase. The vast majority are unit tests. The testing infrastructure (TAP — Testing Automation Platform) prioritises running only the tests affected by a change, keeping per-change feedback time below five minutes even in a monorepo with billions of lines of code. The pyramid is enforced structurally: end-to-end tests require explicit justification.
Pact — consumer-driven contract testing: Pact is an open-source framework for consumer-driven contract testing. Consumers define expectations as Pact files. Producers verify against those files. A Pact Broker publishes and retrieves contracts. The system makes contract verification part of each service’s CI, not a manual cross-team coordination step.
Netflix’s chaos engineering: Netflix tests at the top of the pyramid in production with Chaos Monkey — deliberately terminating random instances to verify that the system tolerates failures. This is an extreme form of end-to-end testing: test the failure modes, not just the success paths, in the actual production environment.
Stripe’s API testing: Stripe provides a test mode that mirrors production exactly. Every Stripe API call in test mode behaves identically to production except no real money moves. This is an infrastructure-level support for integration testing: the integration test environment is officially maintained and behaves deterministically.
Concept: Testing Strategy
Thread: T9 (Consensus) ← Book 4, Ch 9 (Agreement protocols) → Ch 16 (Refactoring Techniques)
Core Idea: The test pyramid — many unit tests, fewer integration tests, fewer contract tests, very few end-to-end tests — optimises for fast feedback and high reliability; the inverted pyramid (more end-to-end than unit tests) produces slow, flaky suites that developers avoid.
Tradeoff: AT9 — Correctness/Performance: lower-layer tests give the same correctness signal as upper-layer tests in a fraction of the time; the total feedback loop speed determines how often tests are run and how much they actually prevent.
Failure Mode: FM8 — Schema/Contract Violation: service interface changes that are not caught by contract tests reach production as compatibility failures; without contract tests, schema changes require manual cross-team coordination.
Signal: When the test suite takes more than 10 minutes to run and developers routinely skip running tests locally, the test pyramid is inverted and end-to-end tests dominate.
Maps to: Book 0, Framework 9 (Review Questions); P6 (Reproducibility), P13 (Fail Fast)
1. Describe the test pyramid: what are the four layers, and what is the intended proportion of tests at each layer?
2. What is the difference between a unit test and an integration test? What kind of bug can an integration test catch that a unit test cannot?
3. What are consumer-driven contract tests, and which failure mode do they exist to prevent before deployment?
Classify each of the following tests into the correct pyramid
layer and justify: (a) a test that creates a Money object
and verifies that add(Money(10), Money(5)) returns
Money(15); (b) a test that calls
GET /api/orders/123 against a running service and checks
the HTTP status is 200; (c) a test that inserts a row into a test
database through a UserRepository and queries it back; (d)
a test that opens a browser, logs in, adds a product to a cart, and
checks out.
A team’s CI pipeline takes 38 minutes. Analyse this breakdown: unit tests 200, runtime 2 minutes; integration tests 150, runtime 18 minutes; end-to-end tests 80, runtime 18 minutes. Propose three changes to bring the total under 10 minutes without removing coverage. Justify each change using the pyramid.
An orders service changes its GET /orders/:id
response to rename total to totalAmount. There
are no contract tests. List all the downstream systems that might break,
describe when each would discover the breakage, and calculate the total
time-to-detection if deployment to production takes 2 hours.
A complete answer will: (1) define behaviour coverage per layer:
unit tests cover validation logic (invalid card numbers, missing fields,
amount limits) using no test doubles — these are pure functions;
integration tests cover the database interaction (PostgreSQL writes and
reads) using a real test database, not a mock; integration tests for the
card network use a stub server that returns configurable responses;
integration tests for Kafka use an embedded Kafka or a real Kafka
container, (2) specify which test doubles are used where: a stub for the
card network (controls response to test decline, timeout, and approval
scenarios without calling a real network); a spy or recording mock for
Kafka (verifies that the correct event payload is published after a
successful charge); no mocks for PostgreSQL (a real database gives
higher confidence and avoids mock-object drift), (3) design contract
tests using Pact: the payment service is the provider; its consumers
(order service, notification service) define consumer-driven contracts
specifying which fields they read from the PaymentProcessed
event schema — the payment service’s CI runs Pact verification to
confirm the published event matches all consumer contracts before
deployment, and (4) scope end-to-end tests narrowly: cover only the two
highest-risk happy-path scenarios (successful payment and declined
payment) using a real staging environment — estimate test counts: ~50
unit tests (fast, < 1s total), ~20 integration tests (2–5 minutes), 2
end-to-end tests (5–10 minutes); total CI runtime < 15
minutes.
A complete answer will: (1) make the argument for Pact: in a six-service system, API contract breaks are the primary source of integration failures — a provider team that changes an event schema without notifying consumers causes silent runtime failures (FM8 silent semantic drift); Pact makes contracts explicit and catches breaks in CI before deployment, eliminating a class of production incidents, (2) make the argument against: Pact requires ongoing maintenance — every consumer must update its contract when it legitimately adds a new field; every provider must run Pact verification in its CI pipeline; for a team of 6 engineers (one per service), the overhead of maintaining 6 × 5 = 30 potential consumer-provider contracts may exceed the value of the protection, (3) state the break-even conditions: Pact pays off when services are owned by separate teams (different release cycles, no shared planning), the API surface is large (many fields that consumers depend on), and integration incidents are frequent enough to justify the maintenance burden — below 3 services or for a monorepo with coordinated releases, the break-even rarely occurs, and (4) propose an alternative: schema validation using a shared schema registry (e.g., Apache Avro with a Confluent Schema Registry, or JSON Schema validation on both publish and consume sides) protects against schema breaks with less process overhead than Pact — it does not require consumer teams to define explicit contracts, but it does require both sides to validate against a common schema and it catches incompatible schema changes automatically.