Architecture

Two processes, one control plane

z4j is built as brain + many agents:

Brain - one per organization. FastAPI + TanStack Start v1 dashboard. PostgreSQL persistence. Serves the UI, the REST API, and the agent WebSocket endpoint.
Agent - one per application process. Thin pip package. Opens an outbound WebSocket to z4j. Captures events from the local queue engine; executes actions z4j sends back.

This separation buys:

Persistence independence - history lives in Postgres, not in the broker. Broker crashes don’t erase history.
Engine abstraction - z4j speaks one wire protocol; agents translate to/from the engine’s API.
Network safety - agents initiate all connections. z4j never needs to reach into your application VPC.

Request lifecycle: event capture

your code: celery_app.send_task("email.send", ...)
  │
  ▼
Celery broker (Redis / RabbitMQ / SQS)
  │
  ▼
Celery worker picks up
  │
  ▼
z4j-celery patch captures task_sent / task_received / task_success / task_retry / task_failure
  │
  ▼
z4j-bare dispatcher: redacts secrets, buffers, flushes on batch/time window
  │
  ▼
WebSocket frame: {type: "event_batch", payload: {events: [...]}}
  │
  ▼
brain: validates, persists to `events` table, fans out to connected dashboards over their /ws/dashboard WebSocket

Request lifecycle: unified action

dashboard: user clicks "Retry"
  │
  ▼
REST: POST /api/v1/projects/{slug}/commands/retry-task
  │
  ▼
brain: authorize → mint signed command row → look up target agent → push WebSocket frame
  │
  ▼
WebSocket frame: {type: "command", payload: {verb: "retry_task", ...}}
  │
  ▼
agent: dispatch to engine adapter
  │
  ▼
adapter: if engine native retry exists, call it;
         else brain-polyfilled: re-enqueue with the original payload + mark original as cancelled
  │
  ▼
response frame: {type: "command_result", payload: {ok: true, ...}}
  │
  ▼
brain: write audit log entry, return the command row to the dashboard

Three adapter axes

Axis	Examples	What it adapts
Framework	django / flask / fastapi / bare	Process boot, settings parsing, ASGI/WSGI teardown
Engine	celery / rq / dramatiq / huey / arq / taskiq	Task enqueue, event capture, retry/cancel semantics
Scheduler	celery-beat / rq-scheduler / apscheduler / huey-periodic / arq-cron / taskiq-scheduler	Periodic task CRUD

They compose freely. A Django + Celery + Beat app uses three adapters; a Flask + RQ + rq-scheduler app uses three different ones. Any combination is supported.

Persistence model

Table	Purpose	Retention
`projects`	Tenants	Unlimited
`agents`	Registered agents	Unlimited
`tasks`	Task identity (one row per task_id)	Per-project `retention_days`, default 30
`events`	Per-state event stream (sent/started/…/finished)	`Z4J_EVENT_RETENTION_DAYS`, default 30
`schedules`	Scheduler entries	Unlimited
`audit_log`	Admin actions, auth events. HMAC-chained	`Z4J_AUDIT_RETENTION_DAYS`, default 90

See database schema for full field docs.

Why WebSocket (not polling / not gRPC / not message queue)

Agent-initiated outbound only - no inbound firewall holes. This is a security win.
Bidirectional with low overhead - one socket carries events and commands. An HTTP longpoll fallback (POST /api/v1/agent/events, GET /api/v1/agent/commands) is available for networks that block WebSockets.
gRPC - would add a heavy dep (grpcio) to every agent. WebSocket is in stdlib-adjacent space via websockets.
Message queue - adding Redis/RabbitMQ as a dependency for the control plane (when the thing we’re observing often is those brokers) creates a circular operational dep.

Failure modes

Failure	Behavior
Agent → brain network partition	Agent buffers events (bounded queue, spills to disk at cap). Reconnects with exponential backoff. On reconnect, flushes buffer.
Brain crash	Postgres has the data. Agents reconnect. No events are acknowledged until persisted - at-least-once delivery.
Postgres crash	Brain returns 503 until DB recovers. Agents buffer as above.
Worker / queue backlog	Doesn’t affect z4j - we observe it, we don’t participate in it.

See reconciliation for how we detect “task said it started, never said it finished.”