Durable Work in PostgresPart 0

Implement a durable work queue in Postgres

How to implement common queue and messaging patterns in Postgres: competing consumers, transactional outbox, idempotent handlers, leases, and dead letters.

You may not need Kafka, Redis, SQS, or a workflow engine for durable work inside one service. If your app already depends on Postgres, store work as rows, claim it with SKIP LOCKED, recover crashed workers with leases, and inspect the backlog with SQL.

Build it in this order

StepShip whenWhat you addStop if
Durable row + claim loopYou need work to survive deploysinbox, pending, processing, SKIP LOCKEDOne worker is enough
Recovery + idempotencyA worker can die mid-handlerLeases, lease cleanup, retry, dead letter, handler dedupeYou can tolerate at-least-once delivery
Multiple workersOne process cannot keep upCompeting consumers, per-key ordering guard when neededWorker count is stable
Stable ownershipAutoscale or deploy churn moves too much workWorker heartbeats, hash ring, bucket-filtered claimsRebalances are rare
Production controlsLong handlers, deploy drains, hot keys, idle polling show upRenewal, fencing, drain, housekeeping, key sharding, notify wakeupsThe queue has runbooks and tests
Boundaries and transportWork leaves this databaseTransactional outbox, broker/webhook transport, receiver inbox, contractsYou have measurable eventual consistency
Durable workflowsOne business process has multiple durable stepsWorkflow instances, step rows, durable sleeps, signals, cancellationWorkflow orchestration becomes platform infrastructure

What problem this solves

Checkout for order:9182 succeeds. The API returns 200. The receipt email is supposed to send after the order commit, but the worker restarts during deploy. If the work only lived in memory, there is no row to claim, retry, inspect, or dead-letter.

The durable version is standard queue design implemented in Postgres: commit the business change and the work row together, let workers claim rows, recover expired leases, and make side effects safe to run more than once.

Use this when

  • Your service already uses Postgres and the producer can write the work row in the same transaction as the business change.
  • You need durable background work with clear operational state: pending, processing, completed, failed, dead letter.
  • You want SQL visibility into backlog depth, stuck rows, retry counts, and oldest pending age.
  • You need hundreds to low thousands of jobs per second, or slower jobs where handler I/O dominates claim overhead.

Where Postgres stops being enough

  • Use Kafka, Pulsar, or a managed stream when you need a shared event log, long retention, many independent replay consumers, or platform-level fan-out.
  • Use a queue service when cross-service transport is the main problem and same-transaction enqueue is not required.
  • Use a workflow engine when workflow history, timers, signals, replay, and cross-service orchestration become shared platform infrastructure rather than service-local control flow.
  • Use pg-boss, Graphile Worker, or River when a library covers the durable work behavior you need.
  • Use an outbox plus transport when work leaves your database boundary.

The core implementation

NeedEstablished patternPostgres implementation
Multiple workers process one backlogCompeting Consumersinbox rows claimed with SKIP LOCKED
Retry without duplicate logical jobsIdempotent Receiveridempotency_key UNIQUE plus handler dedupe
Recover crashed workersLease / timeout ownershipclaimed_by, lease_expires_at, lease cleanup SQL
Keep related work orderedPartitioned consumptionpartition_key, optional ordering guard, hash ring when needed
Publish after a domain writeTransactional Outboxoutbox row in the same transaction as the business update
Handle permanent failureDead Letter Channeldead_letter status and support queries

How to read the series

Start with the claim loop. Stop when the guarantees match your system. Extend the design only when you hit the matching constraint: crashes need lease cleanup, retries need idempotency, related work needs partition keys, cross-service delivery needs an outbox, and high volume needs claim-path tuning.