Before API gateways were common, every client — mobile app, web browser, third-party developer — spoke directly to backend services. Each service implemented its own authentication, its own rate limiting, its own logging. A mobile client needing one screen made seven requests to seven services, each with a different authentication scheme. When the security team added a new requirement, they updated seven codebases.
The API gateway emerged from a simple observation: certain concerns belong at the boundary between clients and services, not inside each service. Authentication, rate limiting, request routing, protocol translation, observability — these are cross-cutting concerns. Centralising them at the gateway simplifies every backend service and gives the platform team a single place to enforce policy.
That is the gain. There is also a loss.
The Seven Responsibilities an API Gateway Absorbs
Every gateway in production handles seven things, in roughly this order:
- TLS termination — decrypts incoming requests so backend services receive plain HTTP.
- Authentication — validates JWT, API key, or OAuth token before the request reaches a service.
- Authorisation — checks the token's scopes against the endpoint's required permissions.
- Rate limiting — applies per-client and per-endpoint quotas.
- Request routing — matches the URL and method against a routing table; proxies to the target service.
- Request/response transformation — protocol bridging (REST → gRPC), header injection, sometimes schema reshaping.
- Observability — emits a structured log entry, latency metric, and distributed trace span per request.
Each backend service is freed from implementing these. The gateway becomes the single enforcement point.
Authentication and the Caching Tradeoff
The gateway validates credentials on every request before routing. Common patterns: JWT validation (the gateway checks the signature and expiry of a signed token), API key lookup (the gateway queries an auth database), OAuth2 token introspection (the gateway calls the auth service to validate the token).
The auth service call adds latency. To avoid this overhead on every request, the gateway caches validation results for short-lived tokens. The cache TTL must be shorter than token expiry — otherwise the gateway will serve requests after revocation. This is the precise boundary where a security property (revocation works immediately) is traded against a performance property (auth latency near zero).
Rate Limiting: Token Bucket
Token bucket is the standard algorithm. Each client has a bucket with a maximum capacity. Tokens accumulate at a fixed rate (e.g., 100/second). Each request costs one token. When the bucket is empty, requests are rejected with 429 Too Many Requests.
tokens = min(capacity, tokens + rate × elapsed_seconds)
if tokens >= 1:
tokens -= 1
allow request
else:
reject with 429
The state must live in a shared store — usually Redis — so that multiple gateway instances apply consistent limits. A per-process counter on each gateway node lets a client burst N× the intended rate by spreading requests across N nodes. The centralised Redis solves this but creates a new failure surface (FM1).
Backend for Frontend: Multi-Client Reality
A single generic gateway serves all clients, which creates a tension: mobile clients need compact responses, web clients need richer data, third-party APIs need stable contracts. The Backend for Frontend (BFF) pattern creates a separate gateway per client type — one for mobile, one for web, one for third-party. Each BFF aggregates multiple backend calls into a single response tailored to its client.
The decision is AT8 (Coupling vs Cohesion). A single gateway couples all three client contracts to one deployment, meaning mobile and bank-partner APIs cannot evolve independently. Separate BFFs allow each client's API shape to change without coordinating the others. The cost is operating three deployments instead of one.
What You Lose by Centralising
This is the part most engineers underplay when they introduce a gateway. The gateway is a SPOF (FM1) and a latency hop (FM5) by design.
AT5 (Centralisation vs Distribution) is the explicit tradeoff. Centralising auth, rate limiting, and routing reduces duplication in services. The cost is a single point every request passes through — a bottleneck and a SPOF if not replicated. Horizontal scaling behind a load balancer and stateless gateway design (rate-limit state in Redis, no local state) mitigates both, but the load balancer itself must be redundant. Now you have three components on the critical path instead of one.
FM5 (Latency Amplification). Every request through the gateway adds two network hops (client → gateway, gateway → service) plus the gateway's own processing time. If the auth cache is cold, it adds a third hop to the auth service. Slow gateway processing — regex-heavy routing, synchronous auth calls without caching, serial request transformation — amplifies latency for all requests. The math is unforgiving: if gateway P99 is 50ms and backend P99 is 100ms, end-to-end P99 is 150ms, not 100ms.
FM3 (Unbounded Resource Consumption). Rate limiting is the defence against clients sending unbounded request volumes. Without it, a misbehaving client — or a DoS attack — exhausts gateway and backend resources, degrading service for all clients. The gateway must rate-limit itself before it can protect the services behind it.
When Not to Use a Gateway
There is a small but real set of cases:
- Internal-only services with no external clients. The cross-cutting concerns the gateway centralises do not exist here; an internal service mesh (Envoy as sidecar) often fits better.
- Single-client systems. If you have one frontend and a small number of backend services, a thin reverse proxy is enough. The full gateway costs more than it saves.
- Ultra-low-latency paths where every millisecond matters and the centralised hop is unacceptable. These paths bypass the gateway entirely.
The default for multi-client, multi-service systems is still: use a gateway. The cases above are the exceptions.
The Four Real Systems Engineers Pick From
Kong — open-source, built on Nginx. Plugin model: auth, rate limiting, logging compose at the route level. Dominant for Kubernetes deployments.
AWS API Gateway — managed, native Lambda integration, automatic scaling. Higher per-request cost than self-managed gateways; harder to customise for complex routing.
Envoy — high-performance proxy used as a service mesh sidecar or gateway. xDS (discovery service protocol) supports dynamic configuration without restart. Data plane for Istio and AWS App Mesh.
Nginx — the original reverse proxy. Fast and well-understood; requires manual configuration management. Dynamic routing without restart needs Nginx Plus or community modules.
How It Scales
At 10× current load: the gateway tier scales horizontally. Rate-limit state moves to a dedicated Redis cluster. Configuration management — routing rules, rate-limit policies — moves to a control plane with a central configuration store and dynamic push to gateway instances. No deployment is required to change a routing rule.
At 100×: a global gateway deployment routes traffic to the geographically nearest data centre. TLS termination happens at the regional edge. The gateway tier is now a distributed system in its own right, with its own consistency, replication, and failure modes.
This article extracts the core of Book 4, Chapter 6 — API Gateway. The chapter includes the full gateway pipeline diagram (TLS → Auth → Authz → Rate Limit → Routing → Transform → Observability), the auth-cache TTL calculation, the BFF architecture diagram, the rate-limiter math problem (50-request burst against a 100-token bucket), and the worked design problem for a fintech gateway serving three client types with shared Redis rate-limit state.