Designing a Rate Limiter: Token Bucket vs Sliding

A fixed window counter set to 1,000 requests per hour is wrong by design. A client sends 1,000 requests at 11:59 PM. The window resets at midnight. The client sends 1,000 more at 12:01 AM. Two thousand requests in two minutes. The rate limiter approved every one.

This is not an edge case. It is a consequence of how the simplest algorithm works. Every rate limiter algorithm is an answer to a specific version of this problem — and each answer introduces a different cost.

What a Rate Limiter Actually Protects

Rate limiting is not about blocking bad actors. It is about keeping the system alive for good ones.

A single misbehaving client — batch job, retry storm, misconfigured integration — can consume capacity meant for thousands of normal users. A rate limiter caps each client's share. The rest of the system stays healthy.

There are four algorithms in common use. Each solves a different version of the problem.

Algorithm 1: Token Bucket

A bucket holds tokens up to a fixed capacity. Tokens refill at a constant rate. Each request consumes one token. If the bucket is empty, the request is rejected.

Capacity: 10 tokens
Refill:   2 tokens/second

t=0s  [##########]  10 tokens — full
      request burst: 10 requests consume all tokens
t=0s  [          ]  0 tokens — empty, next requests rejected
t=1s  [##        ]  2 tokens added
t=2s  [####      ]  2 more added

Token bucket allows short bursts up to the bucket capacity. A client can spend saved tokens quickly, then must wait for refill. This matches how most real APIs behave — occasional spikes are fine, sustained overload is not.

Kafka, AWS API Gateway, and most commercial API gateways use token bucket or a close variant.

Algorithm 2: Leaky Bucket

Requests enter a queue. The queue drains at a fixed rate. If the queue is full, new requests are dropped.

Requests --> [queue: max 5] --> [drain: 2/sec] --> handler
             ^                                        
         full? drop

Leaky bucket smooths traffic into a constant output rate. No bursting. The bucket leaks at exactly one rate regardless of how fast requests arrive. This suits systems where downstream can only handle a fixed throughput — a payment processor, a third-party SMS gateway.

The cost: bursty legitimate traffic gets queued or dropped, even when the system could handle it.

Algorithm 3: Fixed Window Counter

Divide time into windows (1 minute, 1 hour). Count requests per window. Reject requests once the count exceeds the limit.

Window: 60 seconds, limit: 100 requests

[00:00 - 01:00]  counter: 100  <-- limit hit at 00:55
[01:00 - 02:00]  counter: 0    <-- resets at window boundary

Simple to implement. Cheap to store (one counter per client per window). The problem: a client can send 2× the limit by straddling a boundary.

Limit: 100/minute

00:30 - 01:00  --> 100 requests  (last 30s of window 1)
01:00 - 01:30  --> 100 requests  (first 30s of window 2)

In 60 seconds: 200 requests, both windows report "OK"

That boundary problem is real. It surfaces in production as quota exhaustion with no single window showing a violation.

Algorithm 4: Sliding Window Log

Store a timestamp for every request. On each new request, count timestamps in the past window (e.g., last 60 seconds). Reject if count exceeds the limit.

Now = 01:30, window = 60s, limit = 100

Timestamps in [00:30 - 01:30]: count = 87 --> allow
Timestamps in [00:30 - 01:30]: count = 100 --> reject

Sliding window log fixes the boundary problem. The window moves continuously, not in jumps. The cost: storing timestamps for every request. At high traffic, the log grows large and lookups get expensive.

A sliding window counter is a compromise — store counts per sub-window (e.g., 6×10s buckets for a 60s limit) and compute an approximation. Less accurate, much cheaper.

Where State Lives

The choice of algorithm only matters if you can store and read state fast enough.

In-memory: fastest, not distributed. Works for single-server deployments. Fails when you add a second server — each server has a separate counter.

Redis: distributed and fast. Use INCR with EXPIRE for fixed windows. Use sorted sets for sliding window logs. Redis is the standard answer for distributed rate limiting.

Approximate structures: HyperLogLog and Count-Min Sketch trade accuracy for memory. Useful for very high-cardinality rate limiting (millions of unique client IDs). Accept some error in exchange for predictable memory use.

Single server:   [Client] --> [App + counter in RAM]
                                      |
                               fast, not shared

Distributed:     [Client] --> [App 1] -->|
                 [Client] --> [App 2] --> [Redis counter] --> shared, consistent
                 [Client] --> [App 3] -->|

The 429 Response and Why Retry-After Matters

When you reject a request, return 429 Too Many Requests. Include a Retry-After header.

HTTP/1.1 429 Too Many Requests
Retry-After: 30
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1683000030

Without Retry-After, clients retry immediately. Immediate retries generate more 429s. More 429s trigger more retries. This is a thundering herd — the rate limiter itself becomes the source of overload.

With Retry-After, well-behaved clients back off for the specified interval. Traffic smooths. The system recovers. Most HTTP client libraries respect this header automatically — but only if you send it.

Choosing an Algorithm

Algorithm	Burst allowed	Boundary safe	Storage cost	Use when
Token bucket	Yes	Yes	Low	General-purpose APIs
Leaky bucket	No	Yes	Low	Fixed-rate downstream
Fixed window	Yes	No	Very low	Simple, approximate limiting
Sliding window log	Yes	Yes	High	Strict accuracy required

Token bucket is the right default. It allows normal bursting, prevents sustained overload, and is cheap to implement with a counter and a timestamp.

The Line Worth Repeating

Rate limiting is not about blocking bad actors. It is about keeping the system alive for good ones.

Design the limiter to protect capacity, return honest error responses with retry guidance, and distribute state if you run multiple servers. The batch job that consumed the shared quota at 2 AM was not malicious. It just needed a limit. A well-designed rate limiter would have given it one — and left the quota intact for everyone else.