The email service went down for four hours. Nobody noticed. The operations team found thousands of messages waiting in a queue, intact and undelivered. The service came back up. The emails went out. No data was lost.
This is not luck. It is the one property that makes message queues worth their complexity: the producer and consumer do not need to be alive at the same time.
What a Queue Actually Does
A queue sits between two components: a producer and a consumer. The producer writes messages into the queue. The consumer reads them out. Those two things happen independently.
Producer Queue Consumers
-------- ----- ---------
[Service A] --> [msg][msg][msg] --> [Worker 1]
[msg][msg][msg] --> [Worker 2]
[msg][msg] --> [Worker 3]
The producer does not wait for the consumer to be ready. The consumer does not need the producer to be alive. This is called temporal decoupling. It is the property that saved those emails on Wednesday morning.
Without a queue, both sides must be running at the same time. With a queue, they only need to agree on a message format.
The Three Things Queues Guarantee (and the One They Don't)
Queues offer three properties, depending on configuration.
Durability. Messages survive restarts. Kafka writes to disk. SQS replicates across availability zones. A durable queue holds messages even if the broker crashes.
Delivery guarantees. There are three modes, and they cost different amounts.
- At-most-once: the message is sent once. If the consumer fails, it is dropped. Fast, but lossy.
- At-least-once: the message is retried until acknowledged. No data loss, but the consumer may process it more than once.
- Exactly-once: each message is processed precisely one time. Correct, but expensive — it requires coordination between the broker and the consumer's storage layer.
Most production systems run at-least-once and make their consumers idempotent. That costs less than exactly-once and loses nothing compared to at-most-once.
Fan-out. One message can reach multiple consumers. A single order event triggers an inventory update, an email confirmation, and a fraud check — all from one write to the queue. In Kafka, this is not a broadcast — each consumer group reads the entire topic independently at its own offset. The email service and the analytics service each read the same messages, at different speeds, without interfering with each other. SQS does not support this; once a message is consumed from an SQS queue, it is gone.
This is the structural difference that matters: traditional queues delete messages after consumption; Kafka retains them for a configurable period (default 7 days). Multiple independent consumer groups can each replay the entire topic from any offset. If the analytics service falls behind, it catches up from where it left off — it does not lose messages because the notification service already consumed them.
The one thing queues do not reliably guarantee: ordering. Within a single Kafka partition, order holds. Across partitions, it does not. In SQS standard queues, it does not. If your business logic requires strict ordering per key (all events for a given user in sequence), use the key as the partition key — Kafka routes all messages with the same key to the same partition.
Back-Pressure: What Happens When Consumers Fall Behind
Producers are often faster than consumers. An API receiving requests writes to a queue in microseconds. A consumer calling an external service takes 200 milliseconds per message. The queue fills.
This is back-pressure. The queue absorbs the speed difference — for a while. If consumers stay slow, one of three things happens depending on your configuration:
- The queue grows unbounded. You run out of memory or disk.
- The queue hits its size limit. Producers block and wait.
- The queue drops new messages silently.
Option 3 is the worst. You lose data without knowing it. Always configure an explicit overflow policy and alert on queue depth. Back-pressure is a signal, not a failure. A filling queue tells you consumers need more capacity. A queue dropping messages silently is a bug.
Dead Letter Queues
Some messages always fail. The consumer throws an exception. The retry limit exhausts. The message has nowhere to go.
Without a dead letter queue (DLQ), failed messages either block the queue or vanish. With a DLQ, they move to a separate queue for inspection.
Queue --> [Consumer] --> success --> ack
--> failure
|
retry (x3)
|
[Dead Letter Queue]
|
inspect / replay
A DLQ is a debugging instrument. It shows you which messages your system could not handle and why. Every production queue needs one. Without it, silent failures stay invisible until a customer calls.
When Not to Use a Queue
A queue works when the producer does not need an answer right now. Order processing, email delivery, image resizing, audit logging — all appropriate.
A queue is wrong when you need a synchronous response. A user submits a login form. Your server must check credentials and return a session token in 200 milliseconds. You cannot queue that work and come back later. The user is waiting.
Request/response patterns belong to direct HTTP calls or RPCs. Fire-and-forget patterns belong to queues. Mixing them creates either broken UX or unnecessary complexity.
Delivery Semantics in Practice
At-least-once is the default for most teams. It requires one discipline: make your consumer idempotent.
An idempotent consumer produces the same result whether it processes a message once or ten times. Store the message ID. Check it before acting. If you have already processed message order-8823, skip it on the second delivery.
Exactly-once looks appealing. Kafka supports it within a single cluster. But across a database write and a queue acknowledgment, you need distributed transactions. Those are expensive and failure-prone. Most teams avoid exactly-once and invest in idempotency instead.
Two Failure Modes Worth Naming
Consumer lag (FM3 — Unbounded Resource Consumption). A queue can grow without bound when consumers process messages more slowly than producers write them. Kafka does not apply back-pressure to producers — it accepts messages at full speed regardless of how far behind consumers are. A notification service processing at 8,000/sec receiving 10,000/sec will be 7.2 million messages behind after an hour. Monitor consumer lag (the offset gap between latest produced and latest committed message) and alert before it becomes a multi-hour backlog.
Schema evolution (FM8 — Contract Violation). Kafka retains messages for days or weeks. A consumer running on code from last week may receive messages produced by code from today. When the message schema changes — a field added, renamed, removed — the consumer fails to deserialise newer messages, or silently processes them incorrectly. The failure is often delayed: everything works until the first message produced after the schema change arrives. Use a schema registry with compatibility enforcement (backward or forward compatibility) before any schema changes reach production.
The Queue Is Not a Fix
A queue does not make slow consumers faster. It does not prevent data loss if the broker crashes without durability configured. It does not simplify ordering.
What it does — and does reliably — is absorb the difference between when work arrives and when work gets done.
A queue is a promise that the work will happen — not that it will happen now.
That distinction determines when to use one. If the promise is enough, use a queue. If the caller needs an answer immediately, make a direct call.