Skip to content

Architecture

z4j is built as brain + many agents:

  • Brain - one per organization. FastAPI + TanStack Start v1 dashboard. PostgreSQL persistence. Serves the UI, the REST API, and the agent WebSocket endpoint.
  • Agent - one per application process. Thin pip package. Opens an outbound WebSocket to z4j. Captures events from the local queue engine; executes actions z4j sends back.

This separation buys:

  1. Persistence independence - history lives in Postgres, not in the broker. Broker crashes don’t erase history.
  2. Engine abstraction - z4j speaks one wire protocol; agents translate to/from the engine’s API.
  3. Network safety - agents initiate all connections. z4j never needs to reach into your application VPC.
your code: celery_app.send_task("email.send", ...)
Celery broker (Redis / RabbitMQ / SQS)
Celery worker picks up
z4j-celery patch captures task_sent / task_received / task_success / task_retry / task_failure
z4j-bare dispatcher: redacts secrets, buffers, flushes on batch/time window
WebSocket frame: {type: "event_batch", payload: {events: [...]}}
brain: validates, persists to `events` table, fans out to connected dashboards over their /ws/dashboard WebSocket
dashboard: user clicks "Retry"
REST: POST /api/v1/projects/{slug}/commands/retry-task
brain: authorize → mint signed command row → look up target agent → push WebSocket frame
WebSocket frame: {type: "command", payload: {verb: "retry_task", ...}}
agent: dispatch to engine adapter
adapter: if engine native retry exists, call it;
else brain-polyfilled: re-enqueue with the original payload + mark original as cancelled
response frame: {type: "command_result", payload: {ok: true, ...}}
brain: write audit log entry, return the command row to the dashboard
AxisExamplesWhat it adapts
Frameworkdjango / flask / fastapi / bareProcess boot, settings parsing, ASGI/WSGI teardown
Enginecelery / rq / dramatiq / huey / arq / taskiqTask enqueue, event capture, retry/cancel semantics
Schedulercelery-beat / rq-scheduler / apscheduler / huey-periodic / arq-cron / taskiq-schedulerPeriodic task CRUD

They compose freely. A Django + Celery + Beat app uses three adapters; a Flask + RQ + rq-scheduler app uses three different ones. Any combination is supported.

TablePurposeRetention
projectsTenantsUnlimited
agentsRegistered agentsUnlimited
tasksTask identity (one row per task_id)Per-project retention_days, default 30
eventsPer-state event stream (sent/started/…/finished)Z4J_EVENT_RETENTION_DAYS, default 30
schedulesScheduler entriesUnlimited
audit_logAdmin actions, auth events. HMAC-chainedZ4J_AUDIT_RETENTION_DAYS, default 90

See database schema for full field docs.

Why WebSocket (not polling / not gRPC / not message queue)

Section titled “Why WebSocket (not polling / not gRPC / not message queue)”
  • Agent-initiated outbound only - no inbound firewall holes. This is a security win.
  • Bidirectional with low overhead - one socket carries events and commands. An HTTP longpoll fallback (POST /api/v1/agent/events, GET /api/v1/agent/commands) is available for networks that block WebSockets.
  • gRPC - would add a heavy dep (grpcio) to every agent. WebSocket is in stdlib-adjacent space via websockets.
  • Message queue - adding Redis/RabbitMQ as a dependency for the control plane (when the thing we’re observing often is those brokers) creates a circular operational dep.
FailureBehavior
Agent → brain network partitionAgent buffers events (bounded queue, spills to disk at cap). Reconnects with exponential backoff. On reconnect, flushes buffer.
Brain crashPostgres has the data. Agents reconnect. No events are acknowledged until persisted - at-least-once delivery.
Postgres crashBrain returns 503 until DB recovers. Agents buffer as above.
Worker / queue backlogDoesn’t affect z4j - we observe it, we don’t participate in it.

See reconciliation for how we detect “task said it started, never said it finished.”