Retries & Dead-Letter Queue
Job failures are inevitable. Spooled provides automatic retries with exponential backoff and a dead-letter queue for jobs that can't be processed.
Related guides:
How Retries Work
When a job fails, Spooled automatically schedules a retry with exponential backoff. This prevents overwhelming downstream services and gives transient failures time to resolve.
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#fef3c7', 'primaryTextColor': '#92400e', 'primaryBorderColor': '#f59e0b', 'lineColor': '#6b7280'}}}%%
stateDiagram-v2
[*] --> pending
pending --> claimed: Worker claims
claimed --> completed: Success
claimed --> failed: Error
failed --> pending: Retry (attempts left)
failed --> dlq: Max retries exceeded
dlq --> pending: Manual replay
completed --> [*] Retry Flow
- Worker claims a job and attempts to process it
- Processing fails (exception, timeout, or explicit failure)
- Spooled checks if retry attempts remain
- If retries remain: job is scheduled for later with backoff delay
- If max retries exceeded: job moves to dead-letter queue (DLQ)
Backoff Strategies
By default, Spooled uses exponential backoff with jitter. This spreads out retries and prevents thundering herd problems.
Default Exponential Backoff
With default settings (base=1s, max=1h):
| Attempt | Delay | Cumulative Time |
|---|---|---|
| 1 | 1s | 1s |
| 2 | 2s | 3s |
| 3 | 4s | 7s |
| 4 | 8s | 15s |
| 5 | 16s | 31s |
| 6 | 32s | ~1 min |
| 7 | 64s | ~2 min |
| 8 | 128s | ~4 min |
| 9 | 256s | ~8 min |
| 10 | 512s | ~17 min |
Available Strategies
- Exponential (default) — Delay doubles each attempt: 1s, 2s, 4s, 8s...
- Linear — Constant delay increase: 1s, 2s, 3s, 4s...
- Fixed — Same delay every time: 5s, 5s, 5s, 5s...
- Custom — Provide your own delay function
Retry Configuration
Configure retry behavior when creating jobs:
curl -X POST https://api.spooled.cloud/api/v1/jobs \
-H "Authorization: Bearer sp_live_YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"queue_name": "my-queue",
"payload": {
"event": "user.created",
"user_id": "usr_123",
"email": "alice@example.com"
},
"idempotency_key": "user-created-usr_123"
}'Configuration Options
| Option | Type | Default | Description |
|---|---|---|---|
max_retries | number | 3 | Maximum retry attempts before moving to DLQ |
timeout_seconds | number | 300 | Job timeout (seconds) |
priority | number | 0 | Job priority (-100 to 100) |
Dead-Letter Queue (DLQ)
Jobs that exhaust all retry attempts land in the dead-letter queue. The DLQ preserves the full job payload and failure history for debugging and replay.
Dashboard Tip
What to look for:
- → Job payload and metadata
- → Last error message
- → Retry count and failure history
- → Original queue name
Actions:
- ✓ Retry individual jobs
- ✓ Bulk retry all DLQ jobs
- ✓ Purge old failed jobs
List DLQ Jobs
# List jobs in dead-letter queue
curl -X GET "https://api.spooled.cloud/api/v1/jobs/dlq?queue_name=my-queue&limit=100" \
-H "Authorization: Bearer sp_live_YOUR_API_KEY"Retry DLQ Jobs
# Retry jobs from DLQ
curl -X POST https://api.spooled.cloud/api/v1/jobs/dlq/retry \
-H "Authorization: Bearer sp_live_YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"queue_name": "my-queue",
"limit": 50
}'Purge DLQ
# Purge DLQ jobs (requires confirm: true)
curl -X POST https://api.spooled.cloud/api/v1/jobs/dlq/purge \
-H "Authorization: Bearer sp_live_YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"queue_name": "my-queue",
"confirm": true
}'DLQ Best Practices
- • Set up alerts when jobs enter the DLQ
- • Review DLQ jobs regularly (daily for critical queues)
- • Fix the root cause before replaying jobs
- • Consider archiving old DLQ jobs to cold storage
Handling Failures in Workers
Workers should explicitly mark jobs as failed with a reason. This helps with debugging and determines retry behavior.
# Fail a job (will retry if retries remaining)
curl -X POST https://api.spooled.cloud/api/v1/jobs/job_xyz123/fail \
-H "Authorization: Bearer sp_live_YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"worker_id": "worker-1",
"error": "Connection timeout"
}'Debug Failed Jobs
What to look for:
- → Error message in last_error field
- → Retry count vs max_retries
- → Job payload for invalid data
- → Timestamps to correlate with logs
Actions:
- ✓ Check worker logs for stack traces
- ✓ Verify external service availability
- ✓ Test payload manually
Next Steps
- Jobs & queues — Understand the job lifecycle
- Building workers — Production worker patterns
- Real-time updates — Monitor job status in real-time