Skip to content

Retries & Dead-Letter Queue

Job failures are inevitable. Spooled provides automatic retries with exponential backoff and a dead-letter queue for jobs that can't be processed.

How Retries Work

When a job fails, Spooled automatically schedules a retry with exponential backoff. This prevents overwhelming downstream services and gives transient failures time to resolve.

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#fef3c7', 'primaryTextColor': '#92400e', 'primaryBorderColor': '#f59e0b', 'lineColor': '#6b7280'}}}%%
stateDiagram-v2
    [*] --> pending
    pending --> claimed: Worker claims
    claimed --> completed: Success
    claimed --> failed: Error
    failed --> pending: Retry (attempts left)
    failed --> dlq: Max retries exceeded
    dlq --> pending: Manual replay
    completed --> [*]

Retry Flow

  1. Worker claims a job and attempts to process it
  2. Processing fails (exception, timeout, or explicit failure)
  3. Spooled checks if retry attempts remain
  4. If retries remain: job is scheduled for later with backoff delay
  5. If max retries exceeded: job moves to dead-letter queue (DLQ)

Backoff Strategies

By default, Spooled uses exponential backoff with jitter. This spreads out retries and prevents thundering herd problems.

Default Exponential Backoff

With default settings (base=1s, max=1h):

Attempt Delay Cumulative Time
1 1s 1s
2 2s 3s
3 4s 7s
4 8s 15s
5 16s 31s
6 32s ~1 min
7 64s ~2 min
8 128s ~4 min
9 256s ~8 min
10 512s ~17 min

Available Strategies

  • Exponential (default) — Delay doubles each attempt: 1s, 2s, 4s, 8s...
  • Linear — Constant delay increase: 1s, 2s, 3s, 4s...
  • Fixed — Same delay every time: 5s, 5s, 5s, 5s...
  • Custom — Provide your own delay function

Retry Configuration

Configure retry behavior when creating jobs:

Job with retry configuration
curl -X POST https://api.spooled.cloud/api/v1/jobs \
  -H "Authorization: Bearer sp_live_YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "queue_name": "my-queue",
    "payload": {
      "event": "user.created",
      "user_id": "usr_123",
      "email": "alice@example.com"
    },
    "idempotency_key": "user-created-usr_123"
  }'

Configuration Options

Option Type Default Description
max_retries number 3 Maximum retry attempts before moving to DLQ
timeout_seconds number 300 Job timeout (seconds)
priority number 0 Job priority (-100 to 100)

Dead-Letter Queue (DLQ)

Jobs that exhaust all retry attempts land in the dead-letter queue. The DLQ preserves the full job payload and failure history for debugging and replay.

Dashboard Tip

📍 Dashboard → Jobs → Dead Letter Queue

What to look for:

  • Job payload and metadata
  • Last error message
  • Retry count and failure history
  • Original queue name

Actions:

  • Retry individual jobs
  • Bulk retry all DLQ jobs
  • Purge old failed jobs

List DLQ Jobs

# List jobs in dead-letter queue
curl -X GET "https://api.spooled.cloud/api/v1/jobs/dlq?queue_name=my-queue&limit=100" \
  -H "Authorization: Bearer sp_live_YOUR_API_KEY"

Retry DLQ Jobs

# Retry jobs from DLQ
curl -X POST https://api.spooled.cloud/api/v1/jobs/dlq/retry \
  -H "Authorization: Bearer sp_live_YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "queue_name": "my-queue",
    "limit": 50
  }'

Purge DLQ

# Purge DLQ jobs (requires confirm: true)
curl -X POST https://api.spooled.cloud/api/v1/jobs/dlq/purge \
  -H "Authorization: Bearer sp_live_YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "queue_name": "my-queue",
    "confirm": true
  }'

DLQ Best Practices

  • • Set up alerts when jobs enter the DLQ
  • • Review DLQ jobs regularly (daily for critical queues)
  • • Fix the root cause before replaying jobs
  • • Consider archiving old DLQ jobs to cold storage

Handling Failures in Workers

Workers should explicitly mark jobs as failed with a reason. This helps with debugging and determines retry behavior.

Failing a job
# Fail a job (will retry if retries remaining)
curl -X POST https://api.spooled.cloud/api/v1/jobs/job_xyz123/fail \
  -H "Authorization: Bearer sp_live_YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "worker_id": "worker-1",
    "error": "Connection timeout"
  }'

Debug Failed Jobs

📍 Dashboard → Jobs → Failed

What to look for:

  • Error message in last_error field
  • Retry count vs max_retries
  • Job payload for invalid data
  • Timestamps to correlate with logs

Actions:

  • Check worker logs for stack traces
  • Verify external service availability
  • Test payload manually

Next Steps