Architecture

Spooled is built for reliability, performance, and multi-tenant security. This guide explains the system architecture and design decisions.

System Overview

Spooled is a distributed job queue system built with Rust for maximum performance and safety. The architecture consists of several key layers:

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#ecfdf5', 'primaryTextColor': '#065f46', 'primaryBorderColor': '#10b981', 'lineColor': '#6b7280'}}}%%
flowchart TB
    subgraph clients["Client Layer"]
        SDK["SDKs<br/>Node / Python / Go / PHP"]
        HTTP["REST API<br/>OpenAPI 3.0"]
        GRPC["gRPC API"]
    end

    subgraph core["Core Services"]
        API["API Gateway<br/>Axum + Tower"]
        AUTH["Auth Service"]
        QUEUE["Queue Engine"]
        SCHED["Scheduler"]
        STREAM["Event Stream"]
    end

    subgraph storage["Data Layer"]
        PG[("PostgreSQL<br/>Jobs + RLS")]
        RD[("Redis<br/>Pub/Sub + Cache")]
    end

    subgraph workers["Worker Layer"]
        W1["Worker Pool 1"]
        W2["Worker Pool 2"]
        W3["Worker Pool N"]
    end

    subgraph observability["Observability"]
        PROM["Prometheus"]
        GRAF["Grafana"]
        LOGS["Structured Logs"]
    end

    SDK --> API
    HTTP --> API
    GRPC --> API
    API --> AUTH
    AUTH --> QUEUE
    QUEUE --> PG
    QUEUE --> RD
    SCHED --> QUEUE
    STREAM --> RD
    
    W1 --> API
    W2 --> API
    W3 --> API
    
    API --> PROM
    PROM --> GRAF

Core Components

API Gateway (Axum + Tower)

The API gateway handles all incoming requests. Built with Axum, it provides:

REST API — OpenAPI 3.0 compliant, JSON over HTTPS
gRPC API — HTTP/2 + Protobuf with streaming for high-throughput workers
WebSocket/SSE — Real-time job status streaming
Rate limiting — Per-organization request throttling via Redis
Request validation — Schema validation for all inputs

Queue Engine

The core job processing engine manages job lifecycle, retries, and scheduling:

FOR UPDATE SKIP LOCKED — Safe concurrent job claiming without conflicts
Configurable exponential backoff with jitter
Priority queue support (higher priority = processed first)
Scheduled job support with second-level precision
Lease-based job ownership with automatic recovery

Background Scheduler

The scheduler runs background tasks for system maintenance:

Task	Interval	Purpose
Activate scheduled jobs	5s	Move scheduled→pending
Process cron schedules	10s	Create jobs from cron
Recover expired leases	30s	Handle worker failures
Update job dependencies	10s	Unblock child jobs
Cleanup stale workers	5m	Mark offline workers
Data retention	5m	Delete old data

PostgreSQL (Data Plane)

PostgreSQL stores all job data with Row-Level Security (RLS) for multi-tenant isolation:

Durability — ACID transactions ensure no job loss
RLS — Organizations can only see their own data
Indexing — Optimized for queue operations and time-based queries
Partitioning — Time-based partitioning for large deployments

Redis (Real-time Layer)

Redis handles real-time features and caching:

Pub/Sub — Real-time job status notifications
Rate limiting — Token bucket counters per organization
Caching — API key validation and org metadata
Cluster support — Horizontal scaling for high availability

gRPC Service

High-performance worker communication using HTTP/2 + Protobuf via tonic:

Port: 50051 (HTTP/2)
Health Service: grpc.health.v1.Health for load balancer probes
Reflection: Enabled for debugging with grpcurl/Postman
Streaming: StreamJobs for push-based delivery, ProcessJobs for bidirectional

Job Lifecycle

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#f3e8ff', 'primaryTextColor': '#6b21a8', 'primaryBorderColor': '#a855f7', 'lineColor': '#6b7280'}}}%%
stateDiagram-v2
    [*] --> Pending: Enqueue
    Pending --> Processing: Worker claims
    Processing --> Completed: Success
    Processing --> Failed: Error
    Failed --> Pending: Retry
    Failed --> DeadLetter: Max retries
    Completed --> [*]
    DeadLetter --> Pending: Manual retry

Job States

State	Description
`pending`	Ready to be claimed by a worker
`processing`	Claimed by a worker, in progress
`completed`	Successfully processed
`failed`	Failed, will retry if retries remaining
`dead_letter`	Failed after exhausting all retries
`cancelled`	Manually cancelled

Multi-Tenant Security

Every API request is scoped to a single organization using PostgreSQL Row-Level Security. This provides defense-in-depth isolation at the database level.

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#eff6ff', 'primaryTextColor': '#1e40af', 'primaryBorderColor': '#3b82f6', 'lineColor': '#6b7280'}}}%%
flowchart LR
    subgraph request["Incoming Request"]
        TOKEN["API Key"]
    end

    subgraph auth["Authentication"]
        VALIDATE["Validate Key"]
        EXTRACT["Extract org_id"]
    end

    subgraph pg["PostgreSQL RLS"]
        SET["SET app.current_org"]
        POLICY["RLS Policy Check"]
        DATA["Org's Data Only"]
    end

    TOKEN --> VALIDATE
    VALIDATE --> EXTRACT
    EXTRACT --> SET
    SET --> POLICY
    POLICY --> DATA

How RLS Works

API key is validated and organization ID extracted
Connection sets app.current_org session variable
All queries automatically filter by org_id = current_setting('app.current_org')
Even raw SQL access cannot cross tenant boundaries

Security Layers

1. Network Security TLS, Firewall, Network Policies

2. Rate Limiting Per-IP, Per-API-Key, Per-Queue

3. Authentication API Keys, JWT, bcrypt hashing

4. Authorization Row-Level Security, Queue ACLs

5. Input Validation Schema validation, Sanitization

6. Security Headers CSP, HSTS, X-Frame-Options

Retry Mechanism

Failed jobs automatically retry with configurable exponential backoff. The retry system ensures reliable delivery while preventing thundering herd problems.

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#fef3c7', 'primaryTextColor': '#92400e', 'primaryBorderColor': '#f59e0b', 'lineColor': '#6b7280'}}}%%
sequenceDiagram
    participant W as Worker
    participant S as Spooled
    participant DLQ as Dead Letter Queue

    W->>S: Claim job
    S-->>W: Job data
    W->>W: Process (fails)
    W->>S: Fail job
    S->>S: Check retry count
    alt retries remaining
        S->>S: Schedule retry (backoff)
        Note over S: Wait 2^n seconds
        S-->>W: Job available again
    else max retries exceeded
        S->>DLQ: Move to DLQ
        Note over DLQ: Manual review
    end

Backoff Formula

Default backoff uses exponential delay with jitter:

delay = min(base_delay * 2^attempt + random_jitter, max_delay)

Where:

base_delay = 1 second
max_delay = 1 hour
random_jitter = 0-500ms

Scalability

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#fce7f3', 'primaryTextColor': '#9d174d', 'primaryBorderColor': '#ec4899', 'lineColor': '#6b7280'}}}%%
flowchart TB
    LB["Load Balancer"]
    
    subgraph backends["Backend Pods"]
        B1["Backend 1"]
        B2["Backend 2"]
        B3["Backend N"]
    end
    
    subgraph data["Data Layer"]
        PGB["PgBouncer<br/>(Connection Pool)"]
        PG[("PostgreSQL<br/>Primary")]
        RD[("Redis<br/>Cluster")]
    end
    
    LB --> B1
    LB --> B2
    LB --> B3
    
    B1 --> PGB
    B2 --> PGB
    B3 --> PGB
    B1 --> RD
    B2 --> RD
    B3 --> RD
    
    PGB --> PG

Connection Pooling

PgBouncer in transaction mode
Each backend: 25 connections to bouncer
Bouncer: 100 connections to PostgreSQL
Result: 10 backends × 25 = 250 virtual connections

Caching Strategy

Data	Cache	TTL
API Keys	Redis	1 hour
Queue Config	Redis	5 minutes
Job Status (hot)	Redis	30 seconds
Rate Limit Counters	Redis	Per window

Performance Characteristics

10,000+

Jobs per second per node

<50ms

P99 enqueue latency

99.99%

Uptime SLA (Pro plan)

Rust

Memory-safe, zero-cost abstractions

Benchmarks

Operation	P50	P95	P99
Health Check	1ms	5ms	10ms
Job Enqueue	5ms	20ms	50ms
Job Dequeue	10ms	30ms	100ms
Job Complete	5ms	15ms	30ms
List Jobs (100)	20ms	50ms	100ms

Capacity Planning

Component	1K req/s	10K req/s	100K req/s
Backend Pods	2	10	50
PostgreSQL	1 primary	1 primary + 2 read	Cluster
Redis	1 node	Sentinel	Cluster
PgBouncer	1	2	4

Request Limits

Resource	Limit
Request body	5MB
Job payload	1MB
gRPC payload	1MB
Webhook payload	5MB
List page size	100
Bulk enqueue	100 jobs

Deployment Options

Option	Best For	Maintenance
Managed Cloud	Most teams	Zero maintenance
Docker Compose	Development, small deployments	Basic ops required
Kubernetes/Helm	Large scale, air-gapped	Full ops team

Next Steps

Deployment guide — Self-hosting instructions
API reference — Complete endpoint documentation
Real-time API — WebSocket, SSE, and gRPC streaming
Open source — Contributing and licensing