Architecture
Two processes, one control plane
Section titled “Two processes, one control plane”z4j is built as brain + many agents:
- Brain - one per organization. FastAPI + TanStack Start v1 dashboard. PostgreSQL persistence. Serves the UI, the REST API, and the agent WebSocket endpoint.
- Agent - one per application process. Thin pip package. Opens an outbound WebSocket to the brain. Captures events from the local queue engine; executes actions the brain sends back.
This separation buys:
- Persistence independence - history lives in Postgres, not in the broker. Broker crashes don’t erase history.
- Engine abstraction - the brain speaks one wire protocol; agents translate to/from the engine’s API.
- Network safety - agents initiate all connections. The brain never needs to reach into your application VPC.
Request lifecycle: event capture
Section titled “Request lifecycle: event capture”your code: celery_app.send_task("email.send", ...) │ ▼Celery broker (Redis / RabbitMQ / SQS) │ ▼Celery worker picks up │ ▼z4j-celery patch captures task_sent / task_received / task_success / task_retry / task_failure │ ▼z4j-bare dispatcher: redacts secrets, buffers, flushes on batch/time window │ ▼WebSocket frame: {type: "event_batch", events: [...]} │ ▼brain: validates, persists to `events` table, fans out to connected dashboards via SSERequest lifecycle: unified action
Section titled “Request lifecycle: unified action”dashboard: user clicks "Retry" │ ▼REST: POST /api/v1/tasks/{id}/retry │ ▼brain: authorize → insert audit log → look up target agent → send WebSocket frame │ ▼WebSocket frame: {type: "command", cmd: "retry_task", ...} │ ▼agent: dispatch to engine adapter │ ▼adapter: if engine native retry exists, call it; else brain-polyfilled: re-enqueue with the original payload + mark original as cancelled │ ▼response frame: {type: "command_result", ok: true} │ ▼brain: close audit log, return 200 to dashboardThree adapter axes
Section titled “Three adapter axes”| Axis | Examples | What it adapts |
|---|---|---|
| Framework | django / flask / fastapi / bare | Process boot, settings parsing, ASGI/WSGI teardown |
| Engine | celery / rq / dramatiq / huey / arq / taskiq | Task enqueue, event capture, retry/cancel semantics |
| Scheduler | celery-beat / rq-scheduler / apscheduler / huey-periodic / arq-cron / taskiq-scheduler | Periodic task CRUD |
They compose freely. A Django + Celery + Beat app uses three adapters; a Flask + RQ + rq-scheduler app uses three different ones. Any combination is supported.
Persistence model
Section titled “Persistence model”| Table | Purpose | Retention |
|---|---|---|
projects | Tenants | Unlimited |
agents | Registered agents | Unlimited |
tasks | Task identity (one row per task_id) | 90 days default, configurable |
events | Per-state event stream (sent/started/…/finished) | Same as tasks |
schedules | Scheduler entries | Unlimited |
audit_log | Admin actions, auth events. HMAC-chained | Unlimited |
See database schema for full field docs.
Why WebSocket (not polling / not gRPC / not message queue)
Section titled “Why WebSocket (not polling / not gRPC / not message queue)”- Agent-initiated outbound only - no inbound firewall holes. This is a security win.
- Bidirectional with low overhead - one socket carries events and commands. HTTP long-poll was the fallback path, not the primary.
- gRPC - would add a heavy dep (
grpcio) to every agent. WebSocket is in stdlib-adjacent space viawebsockets. - Message queue - adding Redis/RabbitMQ as a dependency for the control plane (when the thing we’re observing often is those brokers) creates a circular operational dep.
Failure modes
Section titled “Failure modes”| Failure | Behavior |
|---|---|
| Agent → brain network partition | Agent buffers events (bounded queue, spills to disk at cap). Reconnects with exponential backoff. On reconnect, flushes buffer. |
| Brain crash | Postgres has the data. Agents reconnect. No events are acknowledged until persisted - at-least-once delivery. |
| Postgres crash | Brain returns 503 until DB recovers. Agents buffer as above. |
| Worker / queue backlog | Doesn’t affect z4j - we observe it, we don’t participate in it. |
See reconciliation for how we detect “task said it started, never said it finished.”