Durable Work in PostgresPart 1

The claim loop

Minimal Postgres claim loop: enqueue, SKIP LOCKED claim, complete, and handler retry with an interactive simulation.

A customer pays for order:9182. Your API updates the order row and must send a receipt. That second step cannot live only in a setTimeout or a fire-and-forget HTTP call. If the process dies, the work vanishes.

The durable version is to write a row that means “send receipt for order:9182.” A background worker claims it, runs your handler, and marks it done. Postgres holds the backlog, so crashes and deploys become recoverable events instead of silent data loss.

Why a table beats a thread pool

This is a durable work queue. It starts like a standard background job queue: persist a unit of work, let competing consumers claim it, deliver at least once, and make handlers safe to run twice. The same core loop also supports outbox relays, projections, webhook processing, and saga steps.

Here, Postgres is the queue and coordination layer. The enqueue happens in the same ACID commit as your business write. Claims use row locks instead of a broker dispatcher. The backlog is inspectable with SQL when work appears stalled.

Postgres implementationEstablished nameTypical source
Queue table (inbox, outbox)Message ChannelEnterprise Integration Patterns
Many workers, SKIP LOCKED claimCompeting ConsumersEnterprise Integration Patterns
partition_key plus hash ringPartitioned consumptionEIP plus stream systems
Lease plus claimed_byExclusive work ownershipDistributed systems
idempotency_key, safe retriesIdempotent ReceiverEnterprise Integration Patterns
dead_letterDead Letter ChannelEnterprise Integration Patterns
Outbox in the same transaction as domain writeTransactional OutboxMicroservices patterns

What production asks of any queue

Before choosing a queue, write down the failures you must survive: a deploy loses jobs, a customer gets two emails, or a row sits in processing for an hour. The implementation needs an answer for each case.

  • Accepted work must run: if the API returned success, the email, webhook, or ledger entry cannot disappear because a worker restarted mid-deploy.
  • Related work stays together: everything for order:9182 should land on one worker so you can process that stream in order when ordering matters.
  • Scaling must not reshuffle everything: adding a worker should move roughly 1/N of open rows, not every in-flight item.
  • Retries must not duplicate side effects: workers crash, leases expire, and webhooks retry. Handlers must tolerate running twice.
  • Stuck and poison rows are handled: a crashed worker cannot hold rows in processing forever.

Minimal claim loop

Start with one table, one claim query, and one worker process in a loop. For order:9182, the row means send receipt after checkout.

Minimal inbox table
CREATE TABLE inbox (
  id               UUID PRIMARY KEY DEFAULT uuidv7(),
  partition_key    TEXT NOT NULL,
  partition_bucket INT NOT NULL
                   CHECK (partition_bucket >= 0 AND partition_bucket < 1024),
  payload          JSONB NOT NULL,
  status           TEXT NOT NULL DEFAULT 'pending',
  idempotency_key  TEXT,
  claimed_by       TEXT,
  lease_expires_at TIMESTAMPTZ,
  lease_generation BIGINT NOT NULL DEFAULT 0,
  attempts         INT NOT NULL DEFAULT 0,
  max_attempts     INT NOT NULL DEFAULT 5,
  available_at     TIMESTAMPTZ NOT NULL DEFAULT now(),
  created_at       TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE UNIQUE INDEX idx_inbox_idempotency
  ON inbox (idempotency_key) WHERE idempotency_key IS NOT NULL;

Enqueue: producer writes pending

Your API inserts a row in the same transaction as the business write when you can. For order:9182, the payload tells the worker what to run later.

INSERT INTO inbox (partition_key, partition_bucket, payload, idempotency_key)
VALUES (
  'order:9182',
  417,
  '{"type":"send_receipt","order_id":9182}'::jsonb,
  'receipt-9182-v1'
);

Claim: worker takes ownership

Claiming sets status = 'processing', stamps claimed_by, and sets lease_expires_at for crash recovery. Each claim bumps lease_generation so only the current owner can call complete.

WITH picked AS (
  SELECT id FROM inbox
  WHERE status = 'pending' AND available_at <= now()
  ORDER BY created_at, id
  LIMIT 25
  FOR UPDATE SKIP LOCKED
)
UPDATE inbox i
SET status = 'processing',
    claimed_by = $1,
    lease_expires_at = now() + interval '90 seconds',
    lease_generation = lease_generation + 1,
    attempts = attempts + 1
FROM picked WHERE i.id = picked.id
RETURNING i.*;

Complete or retry on handler outcome

UPDATE inbox
SET status = 'completed'
WHERE id = $1
  AND claimed_by = $2
  AND lease_generation = $3
  AND lease_expires_at > now()
  AND status = 'processing';

UPDATE inbox
SET
  status = 'pending',
  claimed_by = NULL,
  lease_expires_at = NULL,
  available_at = now() + (LEAST(POWER(2, attempts)::int, 3600) * interval '1 second')
WHERE id = $1
  AND claimed_by = $2
  AND lease_generation = $3
  AND status = 'processing';

The worker id plus lease_generation are a fence token. If rowCount is zero, this worker lost ownership and must not assume the side effect is complete.

Wire the loop in your language

db is a small inbox repository: each method runs one of the SQL snippets above against a Postgres connection pool.

const WORKER_ID = `${hostname}-${process.pid}`;

async function run() {
  while (true) {
    const batch = await db.claim(WORKER_ID);
    if (batch.length === 0) {
      await sleep(500);
      continue;
    }

    for (const row of batch) {
      try {
        await handle(row);
        await db.complete(row.id, WORKER_ID, row.lease_generation);
      } catch (err) {
        await db.failWithRetry(row.id, WORKER_ID, row.lease_generation, err);
      }
    }
  }
}

Simulation: one worker, one receipt

Where this breaks down

  • Hot polling: a 50ms idle loop on an empty queue will waste CPU and connection time. Back off or use LISTEN.
  • Process death without lease cleanup: rows stuck in processing need leases and a sweeper that returns expired claims to pending.
  • Exactly-once claims: this is at-least-once with guards. Do not promise finance-grade exactly-once without idempotent downstream stores.