Architecture
Spooled is built for reliability, performance, and multi-tenant security. This guide explains the system architecture and design decisions.
System Overview
Spooled is a distributed job queue system built with Rust for maximum performance and safety. The architecture consists of several key layers:
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#ecfdf5', 'primaryTextColor': '#065f46', 'primaryBorderColor': '#10b981', 'lineColor': '#6b7280'}}}%%
flowchart TB
subgraph clients["Client Layer"]
SDK["SDKs<br/>Node / Python / Go / PHP"]
HTTP["REST API<br/>OpenAPI 3.0"]
GRPC["gRPC API"]
end
subgraph core["Core Services"]
API["API Gateway<br/>Axum + Tower"]
AUTH["Auth Service"]
QUEUE["Queue Engine"]
SCHED["Scheduler"]
STREAM["Event Stream"]
end
subgraph storage["Data Layer"]
PG[("PostgreSQL<br/>Jobs + RLS")]
RD[("Redis<br/>Pub/Sub + Cache")]
end
subgraph workers["Worker Layer"]
W1["Worker Pool 1"]
W2["Worker Pool 2"]
W3["Worker Pool N"]
end
subgraph observability["Observability"]
PROM["Prometheus"]
GRAF["Grafana"]
LOGS["Structured Logs"]
end
SDK --> API
HTTP --> API
GRPC --> API
API --> AUTH
AUTH --> QUEUE
QUEUE --> PG
QUEUE --> RD
SCHED --> QUEUE
STREAM --> RD
W1 --> API
W2 --> API
W3 --> API
API --> PROM
PROM --> GRAF Core Components
API Gateway (Axum + Tower)
The API gateway handles all incoming requests. Built with Axum, it provides:
- REST API — OpenAPI 3.0 compliant, JSON over HTTPS
- gRPC API — HTTP/2 + Protobuf with streaming for high-throughput workers
- WebSocket/SSE — Real-time job status streaming
- Rate limiting — Per-organization request throttling via Redis
- Request validation — Schema validation for all inputs
Queue Engine
The core job processing engine manages job lifecycle, retries, and scheduling:
FOR UPDATE SKIP LOCKED— Safe concurrent job claiming without conflicts- Configurable exponential backoff with jitter
- Priority queue support (higher priority = processed first)
- Scheduled job support with second-level precision
- Lease-based job ownership with automatic recovery
Background Scheduler
The scheduler runs background tasks for system maintenance:
| Task | Interval | Purpose |
|---|---|---|
| Activate scheduled jobs | 5s | Move scheduled→pending |
| Process cron schedules | 10s | Create jobs from cron |
| Recover expired leases | 30s | Handle worker failures |
| Update job dependencies | 10s | Unblock child jobs |
| Cleanup stale workers | 5m | Mark offline workers |
| Data retention | 5m | Delete old data |
PostgreSQL (Data Plane)
PostgreSQL stores all job data with Row-Level Security (RLS) for multi-tenant isolation:
- Durability — ACID transactions ensure no job loss
- RLS — Organizations can only see their own data
- Indexing — Optimized for queue operations and time-based queries
- Partitioning — Time-based partitioning for large deployments
Redis (Real-time Layer)
Redis handles real-time features and caching:
- Pub/Sub — Real-time job status notifications
- Rate limiting — Token bucket counters per organization
- Caching — API key validation and org metadata
- Cluster support — Horizontal scaling for high availability
gRPC Service
High-performance worker communication using HTTP/2 + Protobuf via tonic:
- Port:
50051(HTTP/2) - Health Service:
grpc.health.v1.Healthfor load balancer probes - Reflection: Enabled for debugging with grpcurl/Postman
- Streaming:
StreamJobsfor push-based delivery,ProcessJobsfor bidirectional
Job Lifecycle
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#f3e8ff', 'primaryTextColor': '#6b21a8', 'primaryBorderColor': '#a855f7', 'lineColor': '#6b7280'}}}%%
stateDiagram-v2
[*] --> Pending: Enqueue
Pending --> Processing: Worker claims
Processing --> Completed: Success
Processing --> Failed: Error
Failed --> Pending: Retry
Failed --> DeadLetter: Max retries
Completed --> [*]
DeadLetter --> Pending: Manual retry Job States
| State | Description |
|---|---|
pending | Ready to be claimed by a worker |
processing | Claimed by a worker, in progress |
completed | Successfully processed |
failed | Failed, will retry if retries remaining |
dead_letter | Failed after exhausting all retries |
cancelled | Manually cancelled |
Multi-Tenant Security
Every API request is scoped to a single organization using PostgreSQL Row-Level Security. This provides defense-in-depth isolation at the database level.
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#eff6ff', 'primaryTextColor': '#1e40af', 'primaryBorderColor': '#3b82f6', 'lineColor': '#6b7280'}}}%%
flowchart LR
subgraph request["Incoming Request"]
TOKEN["API Key"]
end
subgraph auth["Authentication"]
VALIDATE["Validate Key"]
EXTRACT["Extract org_id"]
end
subgraph pg["PostgreSQL RLS"]
SET["SET app.current_org"]
POLICY["RLS Policy Check"]
DATA["Org's Data Only"]
end
TOKEN --> VALIDATE
VALIDATE --> EXTRACT
EXTRACT --> SET
SET --> POLICY
POLICY --> DATA How RLS Works
- API key is validated and organization ID extracted
- Connection sets
app.current_orgsession variable - All queries automatically filter by
org_id = current_setting('app.current_org') - Even raw SQL access cannot cross tenant boundaries
Security Layers
Retry Mechanism
Failed jobs automatically retry with configurable exponential backoff. The retry system ensures reliable delivery while preventing thundering herd problems.
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#fef3c7', 'primaryTextColor': '#92400e', 'primaryBorderColor': '#f59e0b', 'lineColor': '#6b7280'}}}%%
sequenceDiagram
participant W as Worker
participant S as Spooled
participant DLQ as Dead Letter Queue
W->>S: Claim job
S-->>W: Job data
W->>W: Process (fails)
W->>S: Fail job
S->>S: Check retry count
alt retries remaining
S->>S: Schedule retry (backoff)
Note over S: Wait 2^n seconds
S-->>W: Job available again
else max retries exceeded
S->>DLQ: Move to DLQ
Note over DLQ: Manual review
end Backoff Formula
Default backoff uses exponential delay with jitter:
delay = min(base_delay * 2^attempt + random_jitter, max_delay) Where:
base_delay= 1 secondmax_delay= 1 hourrandom_jitter= 0-500ms
Scalability
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#fce7f3', 'primaryTextColor': '#9d174d', 'primaryBorderColor': '#ec4899', 'lineColor': '#6b7280'}}}%%
flowchart TB
LB["Load Balancer"]
subgraph backends["Backend Pods"]
B1["Backend 1"]
B2["Backend 2"]
B3["Backend N"]
end
subgraph data["Data Layer"]
PGB["PgBouncer<br/>(Connection Pool)"]
PG[("PostgreSQL<br/>Primary")]
RD[("Redis<br/>Cluster")]
end
LB --> B1
LB --> B2
LB --> B3
B1 --> PGB
B2 --> PGB
B3 --> PGB
B1 --> RD
B2 --> RD
B3 --> RD
PGB --> PG Connection Pooling
- PgBouncer in transaction mode
- Each backend: 25 connections to bouncer
- Bouncer: 100 connections to PostgreSQL
- Result: 10 backends × 25 = 250 virtual connections
Caching Strategy
| Data | Cache | TTL |
|---|---|---|
| API Keys | Redis | 1 hour |
| Queue Config | Redis | 5 minutes |
| Job Status (hot) | Redis | 30 seconds |
| Rate Limit Counters | Redis | Per window |
Performance Characteristics
Benchmarks
| Operation | P50 | P95 | P99 |
|---|---|---|---|
| Health Check | 1ms | 5ms | 10ms |
| Job Enqueue | 5ms | 20ms | 50ms |
| Job Dequeue | 10ms | 30ms | 100ms |
| Job Complete | 5ms | 15ms | 30ms |
| List Jobs (100) | 20ms | 50ms | 100ms |
Capacity Planning
| Component | 1K req/s | 10K req/s | 100K req/s |
|---|---|---|---|
| Backend Pods | 2 | 10 | 50 |
| PostgreSQL | 1 primary | 1 primary + 2 read | Cluster |
| Redis | 1 node | Sentinel | Cluster |
| PgBouncer | 1 | 2 | 4 |
Request Limits
| Resource | Limit |
|---|---|
| Request body | 5MB |
| Job payload | 1MB |
| gRPC payload | 1MB |
| Webhook payload | 5MB |
| List page size | 100 |
| Bulk enqueue | 100 jobs |
Deployment Options
| Option | Best For | Maintenance |
|---|---|---|
| Managed Cloud | Most teams | Zero maintenance |
| Docker Compose | Development, small deployments | Basic ops required |
| Kubernetes/Helm | Large scale, air-gapped | Full ops team |
Next Steps
- Deployment guide — Self-hosting instructions
- API reference — Complete endpoint documentation
- Real-time API — WebSocket, SSE, and gRPC streaming
- Open source — Contributing and licensing