Durable Work in PostgresPart 0
Implement a durable work queue in Postgres
How to implement common queue and messaging patterns in Postgres: competing consumers, transactional outbox, idempotent handlers, leases, and dead letters.
You may not need Kafka, Redis, SQS, or a workflow engine for durable work inside one service. If your app already depends on Postgres, store work as rows, claim it with SKIP LOCKED, recover crashed workers with leases, and inspect the backlog with SQL.
Build it in this order
| Step | Ship when | What you add | Stop if |
|---|---|---|---|
| Durable row + claim loop | You need work to survive deploys | inbox, pending, processing, SKIP LOCKED | One worker is enough |
| Recovery + idempotency | A worker can die mid-handler | Leases, lease cleanup, retry, dead letter, handler dedupe | You can tolerate at-least-once delivery |
| Multiple workers | One process cannot keep up | Competing consumers, per-key ordering guard when needed | Worker count is stable |
| Stable ownership | Autoscale or deploy churn moves too much work | Worker heartbeats, hash ring, bucket-filtered claims | Rebalances are rare |
| Production controls | Long handlers, deploy drains, hot keys, idle polling show up | Renewal, fencing, drain, housekeeping, key sharding, notify wakeups | The queue has runbooks and tests |
| Boundaries and transport | Work leaves this database | Transactional outbox, broker/webhook transport, receiver inbox, contracts | You have measurable eventual consistency |
| Durable workflows | One business process has multiple durable steps | Workflow instances, step rows, durable sleeps, signals, cancellation | Workflow orchestration becomes platform infrastructure |
What problem this solves
Checkout for order:9182 succeeds. The API returns 200. The receipt email is supposed to send after the order commit, but the worker restarts during deploy. If the work only lived in memory, there is no row to claim, retry, inspect, or dead-letter.
The durable version is standard queue design implemented in Postgres: commit the business change and the work row together, let workers claim rows, recover expired leases, and make side effects safe to run more than once.
Use this when
- Your service already uses Postgres and the producer can write the work row in the same transaction as the business change.
- You need durable background work with clear operational state: pending, processing, completed, failed, dead letter.
- You want SQL visibility into backlog depth, stuck rows, retry counts, and oldest pending age.
- You need hundreds to low thousands of jobs per second, or slower jobs where handler I/O dominates claim overhead.
Where Postgres stops being enough
- Use Kafka, Pulsar, or a managed stream when you need a shared event log, long retention, many independent replay consumers, or platform-level fan-out.
- Use a queue service when cross-service transport is the main problem and same-transaction enqueue is not required.
- Use a workflow engine when workflow history, timers, signals, replay, and cross-service orchestration become shared platform infrastructure rather than service-local control flow.
- Use pg-boss, Graphile Worker, or River when a library covers the durable work behavior you need.
- Use an outbox plus transport when work leaves your database boundary.
The core implementation
| Need | Established pattern | Postgres implementation |
|---|---|---|
| Multiple workers process one backlog | Competing Consumers | inbox rows claimed with SKIP LOCKED |
| Retry without duplicate logical jobs | Idempotent Receiver | idempotency_key UNIQUE plus handler dedupe |
| Recover crashed workers | Lease / timeout ownership | claimed_by, lease_expires_at, lease cleanup SQL |
| Keep related work ordered | Partitioned consumption | partition_key, optional ordering guard, hash ring when needed |
| Publish after a domain write | Transactional Outbox | outbox row in the same transaction as the business update |
| Handle permanent failure | Dead Letter Channel | dead_letter status and support queries |
How to read the series
Start with the claim loop. Stop when the guarantees match your system. Extend the design only when you hit the matching constraint: crashes need lease cleanup, retries need idempotency, related work needs partition keys, cross-service delivery needs an outbox, and high volume needs claim-path tuning.