Architecture
Two processes, one control plane
Section titled “Two processes, one control plane”z4j is built as brain + many agents:
- Brain - one per organization. FastAPI + TanStack Start v1 dashboard. PostgreSQL persistence. Serves the UI, the REST API, and the agent WebSocket endpoint.
- Agent - one per application process. Thin pip package. Opens an outbound WebSocket to z4j. Captures events from the local queue engine; executes actions z4j sends back.
This separation buys:
- Persistence independence - history lives in Postgres, not in the broker. Broker crashes don’t erase history.
- Engine abstraction - z4j speaks one wire protocol; agents translate to/from the engine’s API.
- Network safety - agents initiate all connections. z4j never needs to reach into your application VPC.
Request lifecycle: event capture
Section titled “Request lifecycle: event capture”your code: celery_app.send_task("email.send", ...) │ ▼Celery broker (Redis / RabbitMQ / SQS) │ ▼Celery worker picks up │ ▼z4j-celery patch captures task_sent / task_received / task_success / task_retry / task_failure │ ▼z4j-bare dispatcher: redacts secrets, buffers, flushes on batch/time window │ ▼WebSocket frame: {type: "event_batch", payload: {events: [...]}} │ ▼brain: validates, persists to `events` table, fans out to connected dashboards over their /ws/dashboard WebSocketRequest lifecycle: unified action
Section titled “Request lifecycle: unified action”dashboard: user clicks "Retry" │ ▼REST: POST /api/v1/projects/{slug}/commands/retry-task │ ▼brain: authorize → mint signed command row → look up target agent → push WebSocket frame │ ▼WebSocket frame: {type: "command", payload: {verb: "retry_task", ...}} │ ▼agent: dispatch to engine adapter │ ▼adapter: if engine native retry exists, call it; else brain-polyfilled: re-enqueue with the original payload + mark original as cancelled │ ▼response frame: {type: "command_result", payload: {ok: true, ...}} │ ▼brain: write audit log entry, return the command row to the dashboardThree adapter axes
Section titled “Three adapter axes”| Axis | Examples | What it adapts |
|---|---|---|
| Framework | django / flask / fastapi / bare | Process boot, settings parsing, ASGI/WSGI teardown |
| Engine | celery / rq / dramatiq / huey / arq / taskiq | Task enqueue, event capture, retry/cancel semantics |
| Scheduler | celery-beat / rq-scheduler / apscheduler / huey-periodic / arq-cron / taskiq-scheduler | Periodic task CRUD |
They compose freely. A Django + Celery + Beat app uses three adapters; a Flask + RQ + rq-scheduler app uses three different ones. Any combination is supported.
Persistence model
Section titled “Persistence model”| Table | Purpose | Retention |
|---|---|---|
projects | Tenants | Unlimited |
agents | Registered agents | Unlimited |
tasks | Task identity (one row per task_id) | Per-project retention_days, default 30 |
events | Per-state event stream (sent/started/…/finished) | Z4J_EVENT_RETENTION_DAYS, default 30 |
schedules | Scheduler entries | Unlimited |
audit_log | Admin actions, auth events. HMAC-chained | Z4J_AUDIT_RETENTION_DAYS, default 90 |
See database schema for full field docs.
Why WebSocket (not polling / not gRPC / not message queue)
Section titled “Why WebSocket (not polling / not gRPC / not message queue)”- Agent-initiated outbound only - no inbound firewall holes. This is a security win.
- Bidirectional with low overhead - one socket carries events and commands. An HTTP longpoll fallback (
POST /api/v1/agent/events,GET /api/v1/agent/commands) is available for networks that block WebSockets. - gRPC - would add a heavy dep (
grpcio) to every agent. WebSocket is in stdlib-adjacent space viawebsockets. - Message queue - adding Redis/RabbitMQ as a dependency for the control plane (when the thing we’re observing often is those brokers) creates a circular operational dep.
Failure modes
Section titled “Failure modes”| Failure | Behavior |
|---|---|
| Agent → brain network partition | Agent buffers events (bounded queue, spills to disk at cap). Reconnects with exponential backoff. On reconnect, flushes buffer. |
| Brain crash | Postgres has the data. Agents reconnect. No events are acknowledged until persisted - at-least-once delivery. |
| Postgres crash | Brain returns 503 until DB recovers. Agents buffer as above. |
| Worker / queue backlog | Doesn’t affect z4j - we observe it, we don’t participate in it. |
See reconciliation for how we detect “task said it started, never said it finished.”