Durable Work in PostgresPart 2
Crash recovery
Leases, lease cleanup, at-least-once delivery, and handler idempotency for Postgres inbox workers.
During a deploy, the worker processing send_receipt for order:9182 gets SIGKILL mid-SMTP. The row sits in processing until cleanup logic notices the expired lease and returns it to the queue.
Lease cleanup is that recovery step: a small SQL update that finds expired leases and moves rows back to pending. The same worker, or a replacement process, claims the row again. The handler may run twice, so idempotency is not optional.
What a lease buys you
When a worker claims a row, it sets status = 'processing', stamps claimed_by, and sets lease_expires_at in the future. That deadline is the crash-recovery window.
Each claim also bumps lease_generation. If a stale process resumes after lease cleanup reclaimed the row, its complete should affect zero rows. Downstream stores can reject stale fence tokens too.
WITH picked AS (
SELECT id
FROM inbox
WHERE status = 'pending'
AND available_at <= now()
ORDER BY created_at, id
LIMIT 25
FOR UPDATE SKIP LOCKED
)
UPDATE inbox i
SET
status = 'processing',
claimed_by = $1,
lease_expires_at = now() + interval '90 seconds',
lease_generation = i.lease_generation + 1,
attempts = i.attempts + 1
FROM picked
WHERE i.id = picked.id
RETURNING i.*;
When the worker dies: lease cleanup
If a process crashes mid-handler, the row stays processing with a stale claimed_by. Without lease cleanup, that receipt never sends. Manual SQL during an incident is a fallback, not the recovery design.
The cleanup query runs at the top of every worker loop. It finds rows whose lease has expired and puts them back on the queue:
statusback topendingclaimed_byandlease_expires_atclearedattemptsleft as-is because it already incremented at claim time
Another claim cycle picks the row up. Delivery is at-least-once: the handler may run twice, so send_receipt must check “already sent” before emailing.
UPDATE inbox
SET
status = 'pending',
claimed_by = NULL,
lease_expires_at = NULL,
available_at = now()
WHERE status = 'processing'
AND lease_expires_at < now()
AND attempts < max_attempts;
UPDATE inbox
SET status = 'dead_letter'
WHERE status = 'processing'
AND lease_expires_at < now()
AND attempts >= max_attempts;
With one worker, the same while (true) loop calls sweepExpiredLeases() before each claim. With many workers, each process may attempt housekeeping, but only one should run cleanup per interval. Use a Postgres advisory lock so workers elect one housekeeper without a separate scheduler.
async function run() {
while (true) {
await db.sweepExpiredLeases();
const batch = await db.claim(WORKER_ID);
for (const row of batch) {
await handleRow(row);
}
if (batch.length === 0) {
await sleep(500);
}
}
}
Two kinds of idempotency
Producer idempotency and handler idempotency solve different failures.
| Layer | Guard | Failure it handles |
|---|---|---|
| Enqueue | idempotency_key UNIQUE on insert | The API retries after a timeout and would otherwise create duplicate rows |
| Handler | Durable “already sent” marker | Lease cleanup reclaims a row and the same side effect could run twice |
Where this breaks down
- Missing handler idempotency: lease cleanup can duplicate side effects without an “already sent” guard.
- Long handlers without lease renewal: a 10-minute job on a 90-second lease loses ownership mid-flight unless you renew.
- Exactly-once claims: this is at-least-once with guards, not finance-grade exactly-once.
Source
Use the article for explanation, then use these files when you want the complete SQL and TypeScript in one place.