Implement a durable work queue in Postgres

You may not need Kafka, Redis, SQS, or a workflow engine for durable work inside one service. If your app already depends on Postgres, store work as rows, claim it with SKIP LOCKED, recover crashed workers with leases, and inspect the backlog with SQL.

Build it in this order

Step	Ship when	What you add	Stop if
Durable row + claim loop	You need work to survive deploys	`inbox`, `pending`, `processing`, `SKIP LOCKED`	One worker is enough
Recovery + idempotency	A worker can die mid-handler	Leases, lease cleanup, retry, dead letter, handler dedupe	You can tolerate at-least-once delivery
Multiple workers	One process cannot keep up	Competing consumers, per-key ordering guard when needed	Worker count is stable
Stable ownership	Autoscale or deploy churn moves too much work	Worker heartbeats, hash ring, bucket-filtered claims	Rebalances are rare
Production controls	Long handlers, deploy drains, hot keys, idle polling show up	Renewal, fencing, drain, housekeeping, key sharding, notify wakeups	The queue has runbooks and tests
Boundaries and transport	Work leaves this database	Transactional outbox, broker/webhook transport, receiver inbox, contracts	You have measurable eventual consistency
Durable workflows	One business process has multiple durable steps	Workflow instances, step rows, durable sleeps, signals, cancellation	Workflow orchestration becomes platform infrastructure

What problem this solves

Checkout for order:9182 succeeds. The API returns 200. The receipt email is supposed to send after the order commit, but the worker restarts during deploy. If the work only lived in memory, there is no row to claim, retry, inspect, or dead-letter.

The durable version is standard queue design implemented in Postgres: commit the business change and the work row together, let workers claim rows, recover expired leases, and make side effects safe to run more than once.

Use this when

Your service already uses Postgres and the producer can write the work row in the same transaction as the business change.
You need durable background work with clear operational state: pending, processing, completed, failed, dead letter.
You want SQL visibility into backlog depth, stuck rows, retry counts, and oldest pending age.
You need hundreds to low thousands of jobs per second, or slower jobs where handler I/O dominates claim overhead.

Where Postgres stops being enough

Use Kafka, Pulsar, or a managed stream when you need a shared event log, long retention, many independent replay consumers, or platform-level fan-out.
Use a queue service when cross-service transport is the main problem and same-transaction enqueue is not required.
Use a workflow engine when workflow history, timers, signals, replay, and cross-service orchestration become shared platform infrastructure rather than service-local control flow.
Use pg-boss, Graphile Worker, or River when a library covers the durable work behavior you need.
Use an outbox plus transport when work leaves your database boundary.

The core implementation

Need	Established pattern	Postgres implementation
Multiple workers process one backlog	Competing Consumers	`inbox` rows claimed with `SKIP LOCKED`
Retry without duplicate logical jobs	Idempotent Receiver	`idempotency_key UNIQUE` plus handler dedupe
Recover crashed workers	Lease / timeout ownership	`claimed_by`, `lease_expires_at`, lease cleanup SQL
Keep related work ordered	Partitioned consumption	`partition_key`, optional ordering guard, hash ring when needed
Publish after a domain write	Transactional Outbox	`outbox` row in the same transaction as the business update
Handle permanent failure	Dead Letter Channel	`dead_letter` status and support queries

How to read the series

Start with the claim loop. Stop when the guarantees match your system. Extend the design only when you hit the matching constraint: crashes need lease cleanup, retries need idempotency, related work needs partition keys, cross-service delivery needs an outbox, and high volume needs claim-path tuning.