Skip to content

Architecture

z4j is built as brain + many agents:

  • Brain - one per organization. FastAPI + TanStack Start v1 dashboard. PostgreSQL persistence. Serves the UI, the REST API, and the agent WebSocket endpoint.
  • Agent - one per application process. Thin pip package. Opens an outbound WebSocket to the brain. Captures events from the local queue engine; executes actions the brain sends back.

This separation buys:

  1. Persistence independence - history lives in Postgres, not in the broker. Broker crashes don’t erase history.
  2. Engine abstraction - the brain speaks one wire protocol; agents translate to/from the engine’s API.
  3. Network safety - agents initiate all connections. The brain never needs to reach into your application VPC.
your code: celery_app.send_task("email.send", ...)
Celery broker (Redis / RabbitMQ / SQS)
Celery worker picks up
z4j-celery patch captures task_sent / task_received / task_success / task_retry / task_failure
z4j-bare dispatcher: redacts secrets, buffers, flushes on batch/time window
WebSocket frame: {type: "event_batch", events: [...]}
brain: validates, persists to `events` table, fans out to connected dashboards via SSE
dashboard: user clicks "Retry"
REST: POST /api/v1/tasks/{id}/retry
brain: authorize → insert audit log → look up target agent → send WebSocket frame
WebSocket frame: {type: "command", cmd: "retry_task", ...}
agent: dispatch to engine adapter
adapter: if engine native retry exists, call it;
else brain-polyfilled: re-enqueue with the original payload + mark original as cancelled
response frame: {type: "command_result", ok: true}
brain: close audit log, return 200 to dashboard
AxisExamplesWhat it adapts
Frameworkdjango / flask / fastapi / bareProcess boot, settings parsing, ASGI/WSGI teardown
Enginecelery / rq / dramatiq / huey / arq / taskiqTask enqueue, event capture, retry/cancel semantics
Schedulercelery-beat / rq-scheduler / apscheduler / huey-periodic / arq-cron / taskiq-schedulerPeriodic task CRUD

They compose freely. A Django + Celery + Beat app uses three adapters; a Flask + RQ + rq-scheduler app uses three different ones. Any combination is supported.

TablePurposeRetention
projectsTenantsUnlimited
agentsRegistered agentsUnlimited
tasksTask identity (one row per task_id)90 days default, configurable
eventsPer-state event stream (sent/started/…/finished)Same as tasks
schedulesScheduler entriesUnlimited
audit_logAdmin actions, auth events. HMAC-chainedUnlimited

See database schema for full field docs.

Why WebSocket (not polling / not gRPC / not message queue)

Section titled “Why WebSocket (not polling / not gRPC / not message queue)”
  • Agent-initiated outbound only - no inbound firewall holes. This is a security win.
  • Bidirectional with low overhead - one socket carries events and commands. HTTP long-poll was the fallback path, not the primary.
  • gRPC - would add a heavy dep (grpcio) to every agent. WebSocket is in stdlib-adjacent space via websockets.
  • Message queue - adding Redis/RabbitMQ as a dependency for the control plane (when the thing we’re observing often is those brokers) creates a circular operational dep.
FailureBehavior
Agent → brain network partitionAgent buffers events (bounded queue, spills to disk at cap). Reconnects with exponential backoff. On reconnect, flushes buffer.
Brain crashPostgres has the data. Agents reconnect. No events are acknowledged until persisted - at-least-once delivery.
Postgres crashBrain returns 503 until DB recovers. Agents buffer as above.
Worker / queue backlogDoesn’t affect z4j - we observe it, we don’t participate in it.

See reconciliation for how we detect “task said it started, never said it finished.”