Crash recovery

During a deploy, the worker processing send_receipt for order:9182 gets SIGKILL mid-SMTP. The row sits in processing until cleanup logic notices the expired lease and returns it to the queue.

Lease cleanup is that recovery step: a small SQL update that finds expired leases and moves rows back to pending. The same worker, or a replacement process, claims the row again. The handler may run twice, so idempotency is not optional.

What a lease buys you

When a worker claims a row, it sets status = 'processing', stamps claimed_by, and sets lease_expires_at in the future. That deadline is the crash-recovery window.

Each claim also bumps lease_generation. If a stale process resumes after lease cleanup reclaimed the row, its complete should affect zero rows. Downstream stores can reject stale fence tokens too.

WITH picked AS (
  SELECT id
  FROM inbox
  WHERE status = 'pending'
    AND available_at <= now()
  ORDER BY created_at, id
  LIMIT 25
  FOR UPDATE SKIP LOCKED
)
UPDATE inbox i
SET
  status = 'processing',
  claimed_by = $1,
  lease_expires_at = now() + interval '90 seconds',
  lease_generation = i.lease_generation + 1,
  attempts = i.attempts + 1
FROM picked
WHERE i.id = picked.id
RETURNING i.*;

When the worker dies: lease cleanup

If a process crashes mid-handler, the row stays processing with a stale claimed_by. Without lease cleanup, that receipt never sends. Manual SQL during an incident is a fallback, not the recovery design.

The cleanup query runs at the top of every worker loop. It finds rows whose lease has expired and puts them back on the queue:

status back to pending
claimed_by and lease_expires_at cleared
attempts left as-is because it already incremented at claim time

Another claim cycle picks the row up. Delivery is at-least-once: the handler may run twice, so send_receipt must check “already sent” before emailing.

UPDATE inbox
SET
  status = 'pending',
  claimed_by = NULL,
  lease_expires_at = NULL,
  available_at = now()
WHERE status = 'processing'
  AND lease_expires_at < now()
  AND attempts < max_attempts;

UPDATE inbox
SET status = 'dead_letter'
WHERE status = 'processing'
  AND lease_expires_at < now()
  AND attempts >= max_attempts;

With one worker, the same while (true) loop calls sweepExpiredLeases() before each claim. With many workers, each process may attempt housekeeping, but only one should run cleanup per interval. Use a Postgres advisory lock so workers elect one housekeeper without a separate scheduler.

async function run() {
  while (true) {
    await db.sweepExpiredLeases();
    const batch = await db.claim(WORKER_ID);

    for (const row of batch) {
      await handleRow(row);
    }

    if (batch.length === 0) {
      await sleep(500);
    }
  }
}

Two kinds of idempotency

Producer idempotency and handler idempotency solve different failures.

Layer	Guard	Failure it handles
Enqueue	`idempotency_key UNIQUE` on insert	The API retries after a timeout and would otherwise create duplicate rows
Handler	Durable “already sent” marker	Lease cleanup reclaims a row and the same side effect could run twice

Simulation: crash and dedupe

Where this breaks down

Missing handler idempotency: lease cleanup can duplicate side effects without an “already sent” guard.
Long handlers without lease renewal: a 10-minute job on a 90-second lease loses ownership mid-flight unless you renew.
Exactly-once claims: this is at-least-once with guards, not finance-grade exactly-once.

Source

Use the article for explanation, then use these files when you want the complete SQL and TypeScript in one place.