Durable Work in PostgresPart 2

Crash recovery

Leases, lease cleanup, at-least-once delivery, and handler idempotency for Postgres inbox workers.

During a deploy, the worker processing send_receipt for order:9182 gets SIGKILL mid-SMTP. The row sits in processing until cleanup logic notices the expired lease and returns it to the queue.

Lease cleanup is that recovery step: a small SQL update that finds expired leases and moves rows back to pending. The same worker, or a replacement process, claims the row again. The handler may run twice, so idempotency is not optional.

What a lease buys you

When a worker claims a row, it sets status = 'processing', stamps claimed_by, and sets lease_expires_at in the future. That deadline is the crash-recovery window.

Each claim also bumps lease_generation. If a stale process resumes after lease cleanup reclaimed the row, its complete should affect zero rows. Downstream stores can reject stale fence tokens too.

Lease fields and handler idempotency
WITH picked AS (
  SELECT id
  FROM inbox
  WHERE status = 'pending'
    AND available_at <= now()
  ORDER BY created_at, id
  LIMIT 25
  FOR UPDATE SKIP LOCKED
)
UPDATE inbox i
SET
  status = 'processing',
  claimed_by = $1,
  lease_expires_at = now() + interval '90 seconds',
  lease_generation = i.lease_generation + 1,
  attempts = i.attempts + 1
FROM picked
WHERE i.id = picked.id
RETURNING i.*;

When the worker dies: lease cleanup

If a process crashes mid-handler, the row stays processing with a stale claimed_by. Without lease cleanup, that receipt never sends. Manual SQL during an incident is a fallback, not the recovery design.

The cleanup query runs at the top of every worker loop. It finds rows whose lease has expired and puts them back on the queue:

  • status back to pending
  • claimed_by and lease_expires_at cleared
  • attempts left as-is because it already incremented at claim time

Another claim cycle picks the row up. Delivery is at-least-once: the handler may run twice, so send_receipt must check “already sent” before emailing.

UPDATE inbox
SET
  status = 'pending',
  claimed_by = NULL,
  lease_expires_at = NULL,
  available_at = now()
WHERE status = 'processing'
  AND lease_expires_at < now()
  AND attempts < max_attempts;

UPDATE inbox
SET status = 'dead_letter'
WHERE status = 'processing'
  AND lease_expires_at < now()
  AND attempts >= max_attempts;

With one worker, the same while (true) loop calls sweepExpiredLeases() before each claim. With many workers, each process may attempt housekeeping, but only one should run cleanup per interval. Use a Postgres advisory lock so workers elect one housekeeper without a separate scheduler.

async function run() {
  while (true) {
    await db.sweepExpiredLeases();
    const batch = await db.claim(WORKER_ID);

    for (const row of batch) {
      await handleRow(row);
    }

    if (batch.length === 0) {
      await sleep(500);
    }
  }
}

Two kinds of idempotency

Producer idempotency and handler idempotency solve different failures.

LayerGuardFailure it handles
Enqueueidempotency_key UNIQUE on insertThe API retries after a timeout and would otherwise create duplicate rows
HandlerDurable “already sent” markerLease cleanup reclaims a row and the same side effect could run twice

Simulation: crash and dedupe

Where this breaks down

  • Missing handler idempotency: lease cleanup can duplicate side effects without an “already sent” guard.
  • Long handlers without lease renewal: a 10-minute job on a 90-second lease loses ownership mid-flight unless you renew.
  • Exactly-once claims: this is at-least-once with guards, not finance-grade exactly-once.