Skip to content

Architecture

Spooled is built for reliability, performance, and multi-tenant security. This guide explains the system architecture and design decisions.

System Overview

Spooled is a distributed job queue system built with Rust for maximum performance and safety. The architecture consists of several key layers:

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#ecfdf5', 'primaryTextColor': '#065f46', 'primaryBorderColor': '#10b981', 'lineColor': '#6b7280'}}}%%
flowchart TB
    subgraph clients["Client Layer"]
        SDK["SDKs<br/>Node / Python / Go / PHP"]
        HTTP["REST API<br/>OpenAPI 3.0"]
        GRPC["gRPC API"]
    end

    subgraph core["Core Services"]
        API["API Gateway<br/>Axum + Tower"]
        AUTH["Auth Service"]
        QUEUE["Queue Engine"]
        SCHED["Scheduler"]
        STREAM["Event Stream"]
    end

    subgraph storage["Data Layer"]
        PG[("PostgreSQL<br/>Jobs + RLS")]
        RD[("Redis<br/>Pub/Sub + Cache")]
    end

    subgraph workers["Worker Layer"]
        W1["Worker Pool 1"]
        W2["Worker Pool 2"]
        W3["Worker Pool N"]
    end

    subgraph observability["Observability"]
        PROM["Prometheus"]
        GRAF["Grafana"]
        LOGS["Structured Logs"]
    end

    SDK --> API
    HTTP --> API
    GRPC --> API
    API --> AUTH
    AUTH --> QUEUE
    QUEUE --> PG
    QUEUE --> RD
    SCHED --> QUEUE
    STREAM --> RD
    
    W1 --> API
    W2 --> API
    W3 --> API
    
    API --> PROM
    PROM --> GRAF

Core Components

API Gateway (Axum + Tower)

The API gateway handles all incoming requests. Built with Axum, it provides:

  • REST API — OpenAPI 3.0 compliant, JSON over HTTPS
  • gRPC API — HTTP/2 + Protobuf with streaming for high-throughput workers
  • WebSocket/SSE — Real-time job status streaming
  • Rate limiting — Per-organization request throttling via Redis
  • Request validation — Schema validation for all inputs

Queue Engine

The core job processing engine manages job lifecycle, retries, and scheduling:

  • FOR UPDATE SKIP LOCKED — Safe concurrent job claiming without conflicts
  • Configurable exponential backoff with jitter
  • Priority queue support (higher priority = processed first)
  • Scheduled job support with second-level precision
  • Lease-based job ownership with automatic recovery

Background Scheduler

The scheduler runs background tasks for system maintenance:

Task Interval Purpose
Activate scheduled jobs5sMove scheduled→pending
Process cron schedules10sCreate jobs from cron
Recover expired leases30sHandle worker failures
Update job dependencies10sUnblock child jobs
Cleanup stale workers5mMark offline workers
Data retention5mDelete old data

PostgreSQL (Data Plane)

PostgreSQL stores all job data with Row-Level Security (RLS) for multi-tenant isolation:

  • Durability — ACID transactions ensure no job loss
  • RLS — Organizations can only see their own data
  • Indexing — Optimized for queue operations and time-based queries
  • Partitioning — Time-based partitioning for large deployments

Redis (Real-time Layer)

Redis handles real-time features and caching:

  • Pub/Sub — Real-time job status notifications
  • Rate limiting — Token bucket counters per organization
  • Caching — API key validation and org metadata
  • Cluster support — Horizontal scaling for high availability

gRPC Service

High-performance worker communication using HTTP/2 + Protobuf via tonic:

  • Port: 50051 (HTTP/2)
  • Health Service: grpc.health.v1.Health for load balancer probes
  • Reflection: Enabled for debugging with grpcurl/Postman
  • Streaming: StreamJobs for push-based delivery, ProcessJobs for bidirectional

Job Lifecycle

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#f3e8ff', 'primaryTextColor': '#6b21a8', 'primaryBorderColor': '#a855f7', 'lineColor': '#6b7280'}}}%%
stateDiagram-v2
    [*] --> Pending: Enqueue
    Pending --> Processing: Worker claims
    Processing --> Completed: Success
    Processing --> Failed: Error
    Failed --> Pending: Retry
    Failed --> DeadLetter: Max retries
    Completed --> [*]
    DeadLetter --> Pending: Manual retry

Job States

State Description
pendingReady to be claimed by a worker
processingClaimed by a worker, in progress
completedSuccessfully processed
failedFailed, will retry if retries remaining
dead_letterFailed after exhausting all retries
cancelledManually cancelled

Multi-Tenant Security

Every API request is scoped to a single organization using PostgreSQL Row-Level Security. This provides defense-in-depth isolation at the database level.

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#eff6ff', 'primaryTextColor': '#1e40af', 'primaryBorderColor': '#3b82f6', 'lineColor': '#6b7280'}}}%%
flowchart LR
    subgraph request["Incoming Request"]
        TOKEN["API Key"]
    end

    subgraph auth["Authentication"]
        VALIDATE["Validate Key"]
        EXTRACT["Extract org_id"]
    end

    subgraph pg["PostgreSQL RLS"]
        SET["SET app.current_org"]
        POLICY["RLS Policy Check"]
        DATA["Org's Data Only"]
    end

    TOKEN --> VALIDATE
    VALIDATE --> EXTRACT
    EXTRACT --> SET
    SET --> POLICY
    POLICY --> DATA

How RLS Works

  1. API key is validated and organization ID extracted
  2. Connection sets app.current_org session variable
  3. All queries automatically filter by org_id = current_setting('app.current_org')
  4. Even raw SQL access cannot cross tenant boundaries

Security Layers

1. Network Security TLS, Firewall, Network Policies
2. Rate Limiting Per-IP, Per-API-Key, Per-Queue
3. Authentication API Keys, JWT, bcrypt hashing
4. Authorization Row-Level Security, Queue ACLs
5. Input Validation Schema validation, Sanitization
6. Security Headers CSP, HSTS, X-Frame-Options

Retry Mechanism

Failed jobs automatically retry with configurable exponential backoff. The retry system ensures reliable delivery while preventing thundering herd problems.

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#fef3c7', 'primaryTextColor': '#92400e', 'primaryBorderColor': '#f59e0b', 'lineColor': '#6b7280'}}}%%
sequenceDiagram
    participant W as Worker
    participant S as Spooled
    participant DLQ as Dead Letter Queue

    W->>S: Claim job
    S-->>W: Job data
    W->>W: Process (fails)
    W->>S: Fail job
    S->>S: Check retry count
    alt retries remaining
        S->>S: Schedule retry (backoff)
        Note over S: Wait 2^n seconds
        S-->>W: Job available again
    else max retries exceeded
        S->>DLQ: Move to DLQ
        Note over DLQ: Manual review
    end

Backoff Formula

Default backoff uses exponential delay with jitter:

delay = min(base_delay * 2^attempt + random_jitter, max_delay)

Where:

  • base_delay = 1 second
  • max_delay = 1 hour
  • random_jitter = 0-500ms

Scalability

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#fce7f3', 'primaryTextColor': '#9d174d', 'primaryBorderColor': '#ec4899', 'lineColor': '#6b7280'}}}%%
flowchart TB
    LB["Load Balancer"]
    
    subgraph backends["Backend Pods"]
        B1["Backend 1"]
        B2["Backend 2"]
        B3["Backend N"]
    end
    
    subgraph data["Data Layer"]
        PGB["PgBouncer<br/>(Connection Pool)"]
        PG[("PostgreSQL<br/>Primary")]
        RD[("Redis<br/>Cluster")]
    end
    
    LB --> B1
    LB --> B2
    LB --> B3
    
    B1 --> PGB
    B2 --> PGB
    B3 --> PGB
    B1 --> RD
    B2 --> RD
    B3 --> RD
    
    PGB --> PG

Connection Pooling

  • PgBouncer in transaction mode
  • Each backend: 25 connections to bouncer
  • Bouncer: 100 connections to PostgreSQL
  • Result: 10 backends × 25 = 250 virtual connections

Caching Strategy

Data Cache TTL
API KeysRedis1 hour
Queue ConfigRedis5 minutes
Job Status (hot)Redis30 seconds
Rate Limit CountersRedisPer window

Performance Characteristics

10,000+
Jobs per second per node
<50ms
P99 enqueue latency
99.99%
Uptime SLA (Pro plan)
Rust
Memory-safe, zero-cost abstractions

Benchmarks

Operation P50 P95 P99
Health Check1ms5ms10ms
Job Enqueue5ms20ms50ms
Job Dequeue10ms30ms100ms
Job Complete5ms15ms30ms
List Jobs (100)20ms50ms100ms

Capacity Planning

Component 1K req/s 10K req/s 100K req/s
Backend Pods21050
PostgreSQL1 primary1 primary + 2 readCluster
Redis1 nodeSentinelCluster
PgBouncer124

Request Limits

Resource Limit
Request body5MB
Job payload1MB
gRPC payload1MB
Webhook payload5MB
List page size100
Bulk enqueue100 jobs

Deployment Options

Option Best For Maintenance
Managed Cloud Most teams Zero maintenance
Docker Compose Development, small deployments Basic ops required
Kubernetes/Helm Large scale, air-gapped Full ops team

Next Steps