Testing Strategy

Introduction

A team has 3,000 end-to-end tests and 200 unit tests. The test suite takes 45 minutes to run. Flakiness is high: 15% of runs fail for reasons unrelated to code changes. Developers run tests locally only before pushing, and even then only sometimes. The test suite catches regressions rarely — not because the tests are wrong, but because they are so expensive to run that engineers avoid them.

A different team has 8,000 unit tests, 400 integration tests, and 50 end-to-end tests. Their suite runs in 4 minutes. Flakiness is below 1%. Developers run the full suite before every push. The first team ships with more regressions despite having more tests. Testing strategy, not test count, determines effectiveness.

Thread Activation

You have already seen consensus (T9) in Book 4, where distributed nodes agree on a value through a defined protocol. Tests are consensus at code level: they express agreement between the specification (what the code should do) and the implementation (what the code does). A test that passes is a vote from the implementation that it matches the specification. You have also seen feedback loops (T11) throughout this book: the test suite is the feedback loop that makes regressions visible before they reach production. The speed of the feedback loop determines its effectiveness.

The Concept

The test pyramid defines the proportion and purpose of each test layer.

Unit tests are the base of the pyramid: many tests, fast execution, testing one unit of behaviour in isolation. “Isolation” means dependencies are replaced with test doubles (stubs, fakes, mocks). A unit test for calculateDiscount provides an Order object and a DiscountCode and asserts on the returned Money — no database, no network. Unit tests are fast (milliseconds each), reliable (no external state), and targeted (a failure points to a specific function). They cover pure logic and business rules.

Integration tests test the behaviour of two or more components together, or the behaviour of one component against real infrastructure (a database, a message queue). An integration test for PostgresOrderRepository.save creates a test database, calls save, queries the database directly, and asserts on the stored values. Slower than unit tests (100ms to 1s each), but they catch the bugs that unit tests cannot: ORM mapping errors, SQL constraint violations, transaction isolation issues.

Contract tests verify that a service’s published interface matches what consumers expect. Consumer-driven contract tests are defined by the consumer: “I expect the orders service to respond with {id, status, items, total} when I call GET /orders/:id.” The orders service runs these tests as part of its test suite, verifying that it does not break its consumers without knowing who they are. Contract tests catch FM8 (Schema/Contract Violation) before deployment, not after.

End-to-end tests exercise the full system through its user-facing interface. They catch failures that only appear when all components are running together: configuration mismatches, environment-specific bugs, workflow failures. They are slow (seconds to minutes each), flaky (network, timing, external service state), and expensive to maintain. Their value is high for critical user journeys; their cost makes them unsuitable as the primary test layer.

The test pyramid implies: write many unit tests, some integration tests, a few contract tests, very few end-to-end tests. The test iceberg (inverted pyramid) — more end-to-end tests than unit tests — produces the situation described in the introduction: slow, flaky tests that developers avoid.

When to use each layer: - Unit tests: all pure functions, all business rules, all edge cases and error conditions - Integration tests: all database operations, all queue interactions, all external API clients - Contract tests: all service-to-service interfaces, all public API endpoints - End-to-end tests: critical user journeys (sign up, purchase, payment), smoke tests for deployment verification

How It Works

Unit test for pure logic:

function testDiscountApplied():
    order = buildTestOrder(items=[{sku: "A", price: 100, qty: 1}])
    discountCode = DiscountCode(percentage=10, minOrderValue=50)

    result = calculateDiscount(order, discountCode)

    assert result.discountAmount == 10
    assert result.finalTotal == 90

function testDiscountNotAppliedBelowMinimum():
    order = buildTestOrder(items=[{sku: "A", price: 40, qty: 1}])
    discountCode = DiscountCode(percentage=10, minOrderValue=50)

    result = calculateDiscount(order, discountCode)

    assert result.discountAmount == 0
    assert result.finalTotal == 40

No infrastructure. These run in under one millisecond.

Integration test for database interaction:

function testOrderRepositorySave():
    db = createTestDatabase()  # spins up a test DB or uses a transaction that rolls back
    repo = PostgresOrderRepository(db)
    order = Order.create(testCart, testCustomer)

    repo.save(order)

    storedOrder = db.queryOne("SELECT * FROM orders WHERE id = ?", order.id)
    assert storedOrder.status == "pending"
    assert storedOrder.total == order.totalPrice()
    assert storedOrder.customer_id == testCustomer.id

    db.rollback()  # clean up

Consumer-driven contract test:

# Consumer (warehouse service) defines what it expects from orders service:
contract OrdersServiceContract:
    interaction GetOrder:
        request:
            method: GET
            path: /orders/123
        response:
            status: 200
            body:
                id: String
                status: String (one of: "pending", "confirmed", "shipped")
                items: List<{sku: String, quantity: Integer, unitPrice: Money}>
                total: Money

# Orders service runs this contract as a test:
function testGetOrderMatchesContract():
    order = createTestOrder(id="123")
    response = testClient.get("/orders/123")

    assert response.status == 200
    assert response.body.id == "123"
    assert response.body.status in ["pending", "confirmed", "shipped"]
    assert isValidMoney(response.body.total)

When the orders service team changes the response schema, they run the contract tests from all consuming services. If any fail, they must either preserve backward compatibility or coordinate a versioned update.

Tradeoffs

AT9 — Correctness/Performance: More tests at lower pyramid layers give better correctness assurance. A unit test that runs in one millisecond provides the same correctness signal for one function as an end-to-end test that runs in one minute — but the unit test enables immediate feedback. The performance (CI speed) of the test suite determines how often it is run, which determines how much it actually catches.

AT3 — Simplicity/Flexibility: Contract tests add process complexity: consumers must publish their contracts, producers must run consumer contracts, both parties must agree on versioning. For a system with one consumer and one producer, this overhead is not justified. For a system with ten consumers, contract tests prevent the coordination overhead of manually verifying every interface on every change.

Where It Fails

FM8 — Schema/Contract Violation: The primary failure mode that testing strategy exists to prevent. Contract tests catch these before deployment. End-to-end tests catch them after deployment to a test environment. Without any contract enforcement, schema violations reach production.

FM11 — Observability Blindness: A test suite that tests only happy paths provides false confidence. When the production system encounters an error case (timeout, partial failure, invalid input), the untested code path runs for the first time in production. Testing strategy must include explicit coverage of error paths: what happens when the database is unavailable, what happens when the external API returns a 500, what happens when the input is malformed.

Real Systems

Google’s testing at scale: Google runs billions of tests per day across its codebase. The vast majority are unit tests. The testing infrastructure (TAP — Testing Automation Platform) prioritises running only the tests affected by a change, keeping per-change feedback time below five minutes even in a monorepo with billions of lines of code. The pyramid is enforced structurally: end-to-end tests require explicit justification.

Pact — consumer-driven contract testing: Pact is an open-source framework for consumer-driven contract testing. Consumers define expectations as Pact files. Producers verify against those files. A Pact Broker publishes and retrieves contracts. The system makes contract verification part of each service’s CI, not a manual cross-team coordination step.

Netflix’s chaos engineering: Netflix tests at the top of the pyramid in production with Chaos Monkey — deliberately terminating random instances to verify that the system tolerates failures. This is an extreme form of end-to-end testing: test the failure modes, not just the success paths, in the actual production environment.

Stripe’s API testing: Stripe provides a test mode that mirrors production exactly. Every Stripe API call in test mode behaves identically to production except no real money moves. This is an infrastructure-level support for integration testing: the integration test environment is officially maintained and behaves deterministically.

Concept: Testing Strategy

Thread: T9 (Consensus) ← Book 4, Ch 9 (Agreement protocols) → Ch 16 (Refactoring Techniques)

Core Idea: The test pyramid — many unit tests, fewer integration tests, fewer contract tests, very few end-to-end tests — optimises for fast feedback and high reliability; the inverted pyramid (more end-to-end than unit tests) produces slow, flaky suites that developers avoid.

Tradeoff: AT9 — Correctness/Performance: lower-layer tests give the same correctness signal as upper-layer tests in a fraction of the time; the total feedback loop speed determines how often tests are run and how much they actually prevent.

Failure Mode: FM8 — Schema/Contract Violation: service interface changes that are not caught by contract tests reach production as compatibility failures; without contract tests, schema changes require manual cross-team coordination.

Signal: When the test suite takes more than 10 minutes to run and developers routinely skip running tests locally, the test pyramid is inverted and end-to-end tests dominate.

Maps to: Book 0, Framework 9 (Review Questions); P6 (Reproducibility), P13 (Fail Fast)

Exercises

Level 1 — Understand

1. Describe the test pyramid: what are the four layers, and what is the intended proportion of tests at each layer?

2. What is the difference between a unit test and an integration test? What kind of bug can an integration test catch that a unit test cannot?

3. What are consumer-driven contract tests, and which failure mode do they exist to prevent before deployment?

Level 2 — Apply

Classify each of the following tests into the correct pyramid layer and justify: (a) a test that creates a Money object and verifies that add(Money(10), Money(5)) returns Money(15); (b) a test that calls GET /api/orders/123 against a running service and checks the HTTP status is 200; (c) a test that inserts a row into a test database through a UserRepository and queries it back; (d) a test that opens a browser, logs in, adds a product to a cart, and checks out.
A team’s CI pipeline takes 38 minutes. Analyse this breakdown: unit tests 200, runtime 2 minutes; integration tests 150, runtime 18 minutes; end-to-end tests 80, runtime 18 minutes. Propose three changes to bring the total under 10 minutes without removing coverage. Justify each change using the pyramid.
An orders service changes its GET /orders/:id response to rename total to totalAmount. There are no contract tests. List all the downstream systems that might break, describe when each would discover the breakage, and calculate the total time-to-detection if deployment to production takes 2 hours.

Level 3 — Design

Design a complete testing strategy for a payment processing service. The service receives payment requests over HTTP, validates them, calls an external card network (Visa/Mastercard), stores results in PostgreSQL, and publishes events to Kafka. Define: which behaviours are covered at each pyramid layer, what test doubles are used and where, how contract tests are structured between this service and its consumers, and what end-to-end tests cover. Estimate test counts and CI runtime.

A complete answer will: (1) define behaviour coverage per layer: unit tests cover validation logic (invalid card numbers, missing fields, amount limits) using no test doubles — these are pure functions; integration tests cover the database interaction (PostgreSQL writes and reads) using a real test database, not a mock; integration tests for the card network use a stub server that returns configurable responses; integration tests for Kafka use an embedded Kafka or a real Kafka container, (2) specify which test doubles are used where: a stub for the card network (controls response to test decline, timeout, and approval scenarios without calling a real network); a spy or recording mock for Kafka (verifies that the correct event payload is published after a successful charge); no mocks for PostgreSQL (a real database gives higher confidence and avoids mock-object drift), (3) design contract tests using Pact: the payment service is the provider; its consumers (order service, notification service) define consumer-driven contracts specifying which fields they read from the PaymentProcessed event schema — the payment service’s CI runs Pact verification to confirm the published event matches all consumer contracts before deployment, and (4) scope end-to-end tests narrowly: cover only the two highest-risk happy-path scenarios (successful payment and declined payment) using a real staging environment — estimate test counts: ~50 unit tests (fast, < 1s total), ~20 integration tests (2–5 minutes), 2 end-to-end tests (5–10 minutes); total CI runtime < 15 minutes.

A team is building a distributed system with six services. They have no contract tests and are considering implementing Pact. Argue for and against implementing Pact: what is the ongoing maintenance cost, what schemas are protected, under what team size and coordination cost does Pact start paying off, and what alternative (if any) provides similar protection with less process overhead?

A complete answer will: (1) make the argument for Pact: in a six-service system, API contract breaks are the primary source of integration failures — a provider team that changes an event schema without notifying consumers causes silent runtime failures (FM8 silent semantic drift); Pact makes contracts explicit and catches breaks in CI before deployment, eliminating a class of production incidents, (2) make the argument against: Pact requires ongoing maintenance — every consumer must update its contract when it legitimately adds a new field; every provider must run Pact verification in its CI pipeline; for a team of 6 engineers (one per service), the overhead of maintaining 6 × 5 = 30 potential consumer-provider contracts may exceed the value of the protection, (3) state the break-even conditions: Pact pays off when services are owned by separate teams (different release cycles, no shared planning), the API surface is large (many fields that consumers depend on), and integration incidents are frequent enough to justify the maintenance burden — below 3 services or for a monorepo with coordinated releases, the break-even rarely occurs, and (4) propose an alternative: schema validation using a shared schema registry (e.g., Apache Avro with a Confluent Schema Registry, or JSON Schema validation on both publish and consume sides) protects against schema breaks with less process overhead than Pact — it does not require consumer teams to define explicit contracts, but it does require both sides to validate against a common schema and it catches incompatible schema changes automatically.

Read in the book →