Production controls

Production breaks the assumptions from the first four parts: handlers run longer, deploys interrupt workers, and one partition can dominate the queue.

This chapter covers the controls you add when real traffic exposes specific failures: renew leases for long jobs, fence stale workers, drain cleanly on deploy, run housekeeping, shard hot keys, and wake idle workers without relying only on polling.

Add controls when the failure appears

The claim loop remains the core. Do not add every control on day one. Add the one that matches the failure you have or the failure you are about to create with deploys, longer handlers, more workers, or higher traffic.

Production symptom	Add this control	Why it works
Handlers can run longer than the lease	Lease renewal	The current owner extends its claim while work is still active.
A stale worker might keep writing after it lost ownership	Fencing token	Each claim gets a newer generation, and downstream writes reject old generations.
Deploys create a wave of expired leases	Graceful drain	Workers stop claiming new rows before the process exits.
Expired leases and dead workers need cleanup	Worker-coordinated housekeeping	One worker wins an advisory lock and runs the same recovery SQL.
One tenant or entity dominates the queue	Shard that partition key	Split one hot stream into explicit substreams when global ordering is not required.
Idle workers poll too often	`LISTEN/NOTIFY` plus polling fallback	Notifications wake workers quickly; polling remains the reliable baseline.

Keep long handlers from losing ownership

The lease duration is a failure-detection window, not a maximum job runtime. If a handler can outlive the lease, renew ownership while it runs. Otherwise lease cleanup can return the row to pending while the original worker is still doing the side effect.

-- Call every 30s for long-running jobs (must be < lease duration)
UPDATE inbox
SET lease_expires_at = now() + interval '90 seconds'
WHERE id = $1
  AND claimed_by = $2
  AND status = 'processing'
  AND lease_expires_at > now();  -- only if we still hold valid lease

Make row ownership explicit

While a row is processing, only the current owner should renew or complete it. Postgres row locks and guarded UPDATE … WHERE clauses enforce that rule. The only risky period is the overlap after a lease expires and another worker claims the row.

Actor	When	Can it touch the row?	Guard
Worker claiming	`status = pending`	One winner per row	`FOR UPDATE SKIP LOCKED` in claim CTE
Second worker claiming	Already `processing`	No. Not in candidate set	`WHERE status = 'pending'`
Owner renewing lease	Still holds valid lease	Yes, extends `lease_expires_at`	`claimed_by = $self AND status = 'processing' AND lease_expires_at > now()`
Owner completing	Finished work	Yes	`id = $1 AND claimed_by = $self`
Lease cleanup	Lease expired	Yes, resets to `pending`	`lease_expires_at < now()`
Another worker after cleanup	Row back to `pending`	Yes, new claim, new `lease_generation`	Normal claim path

Renew before the lease dies

If processing takes longer than the lease, for example a 90-second lease around a two-minute SMTP call, you must renew periodically. A practical default is every 30 seconds for a 90-second lease, or roughly half the lease duration.

async function processWithLease(row, workerId) {
  const renewEveryMs = 30_000;
  const timer = setInterval(async () => {
    const { rowCount } = await db.renewLease(row.id, workerId);
    if (rowCount === 0) {
      clearInterval(timer);
      throw new LeaseLostError("Another worker owns this row: abort");
    }
  }, renewEveryMs);

  try {
    await processPayload(row.payload, row.lease_generation);
    await db.complete(row.id, workerId);
  } finally {
    clearInterval(timer);
  }
}

Stop stale workers from committing

ALTER TABLE inbox ADD COLUMN lease_generation BIGINT NOT NULL DEFAULT 0;

-- On claim: bump generation
WITH picked AS (
  SELECT id
  FROM inbox
  WHERE status = 'pending'
    AND available_at <= now()
  ORDER BY created_at, id
  LIMIT 25
  FOR UPDATE SKIP LOCKED
)
UPDATE inbox i
SET
  status = 'processing',
  claimed_by = $worker_id,
  lease_expires_at = now() + interval '90 seconds',
  lease_generation = i.lease_generation + 1,
  attempts = i.attempts + 1
FROM picked
WHERE i.id = picked.id
RETURNING i.*, i.lease_generation AS fence_token;

-- Downstream write must include fence_token; reject if stored token > incoming

Drain workers during deploys

A deploy should follow a different path than a crash. On SIGTERM, the worker should leave the live ring, stop claiming new rows, and finish or intentionally release the rows it already owns. That keeps a rolling deploy from creating a wave of expired leases.

UPDATE workers SET status = 'draining' WHERE id = $worker_id;

-- Worker loop: skip claiming when draining; exit when in-flight = 0
SELECT count(*) FROM inbox
WHERE claimed_by = $worker_id AND status = 'processing';

Run housekeeping without a separate scheduler

You do not need pg_cron, a sidecar, or a separate scheduler. Every worker already polls Postgres, so piggyback two cheap maintenance tasks on that loop and use an advisory lock so only one worker runs them per cycle.

Lease cleanup: Reset processing rows past lease_expires_at back to pending (or dead_letter if max attempts exceeded).

Stale worker sweep: Mark workers dead when last_seen_at exceeds TTL so the ring rebuilds without them.

const HOUSEKEEPING_LOCK = 847291;
let lastHousekeeping = 0;
const HOUSEKEEPING_MS = 30_000;

async function maybeRunHousekeeping(db) {
  if (Date.now() - lastHousekeeping < HOUSEKEEPING_MS) return;
  lastHousekeeping = Date.now();

  try {
    await db.query("BEGIN");
    const { rows } = await db.query("SELECT pg_try_advisory_xact_lock($1) AS acquired", [
      HOUSEKEEPING_LOCK,
    ]);
    if (!rows[0].acquired) {
      await db.query("COMMIT");
      return; // another worker is housekeeping
    }
    await db.query(`UPDATE workers SET status = 'dead'
      WHERE status IN ('alive','draining')
        AND last_seen_at < now() - interval '30 seconds'`);
    await db.query(`UPDATE inbox
      SET status = 'pending',
          claimed_by = NULL,
          claimed_at = NULL,
          lease_expires_at = NULL,
          available_at = now() + (LEAST(POWER(2, attempts)::int, 3600) * interval '1 second')
      WHERE status = 'processing'
        AND lease_expires_at < now()
        AND attempts < max_attempts`);
    await db.query(`UPDATE inbox
      SET status = 'dead_letter',
          last_error = COALESCE(last_error, 'max attempts during lease cleanup')
      WHERE status = 'processing'
        AND lease_expires_at < now()
        AND attempts >= max_attempts`);
    await db.query("COMMIT");
  } catch (e) {
    await db.query("ROLLBACK");
    throw e;
  }
}

Wake idle workers without trusting notifications for durability

Polling is the reliable baseline. LISTEN/NOTIFY is a latency optimization: it wakes idle workers quickly after enqueue, but workers should still poll as a fallback because notifications are not durable storage.

CREATE OR REPLACE FUNCTION notify_inbox_insert() RETURNS TRIGGER AS $$
BEGIN
  PERFORM pg_notify('inbox_new', NEW.partition_key);
  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER inbox_notify
  AFTER INSERT ON inbox
  FOR EACH ROW EXECUTE FUNCTION notify_inbox_insert();

Shard hot keys when one stream dominates

When one partition_key dominates volume, for example a single large tenant, shard that key explicitly: tenant:123#shard-0 … tenant:123#shard-7. Producers choose the shard. Ordering still holds within each shard, but not across all shards for that tenant.

Source

Use the article for explanation, then use these files when you want the complete SQL and TypeScript in one place.