Skip to content

Scaling

For most deployments one brain replica is plenty. The bottleneck is:

  1. Postgres — plenty of room; sizing in the self-hosting guide.
  2. WebSocket connections — one socket per agent, ~10 KiB RAM steady-state. 1000 agents is roughly 10 MiB.
  3. Event persistence — batched into Postgres; the brain handles thousands of events per second on modest hardware.

Multiple brain replicas are supported on Postgres. The brain selects its registry and dashboard-fan-out backend from Z4J_REGISTRY_BACKEND:

  • postgres_notify (the default on Postgres) — agent commands and dashboard updates fan out across replicas via Postgres LISTEN/NOTIFY. Each agent’s WebSocket lives on whichever replica it happened to connect to; commands minted on any replica route to the right one through the registry. Dashboard subscribers connected to one replica still see events captured by another.
  • local (forced on SQLite, since SQLite has no LISTEN/NOTIFY) — single-process only.

What you still need to provide yourself:

  • Sticky session routing on /ws — each agent’s WebSocket must pin to one brain pod. Configure your load balancer’s session affinity (e.g. nginx-ingress’s nginx.ingress.kubernetes.io/affinity: cookie, an ALB target-group’s stickiness, or your service-mesh equivalent).
  • TLS termination in front of the brain. The brain itself speaks plaintext WebSocket on its bind port; production deployments put a reverse proxy in front.

The dashboard fan-out is over WebSocket (/ws/dashboard), not SSE; cross-replica delivery is the postgres_notify DashboardHub.

  • Read replicas help dashboards but not the hot event-persist path.
  • Native partitioning on events(received_at) is built in; partition retention drops the oldest partition once it ages past Z4J_EVENT_RETENTION_DAYS.
  • Set statement_timeout on the brain’s database role to prevent runaway queries (Z4J_DB_STATEMENT_TIMEOUT_MS).

Agents scale with your app. One agent per app process; the worker-first protocol identifies each worker by (agent_id, worker_id) so multi-worker servers (gunicorn, uwsgi) coexist under a single agent identity. No coordination between agents; deploying more app replicas registers more workers automatically.

If you are running 500+ agents or 100M+ events per day, file an issue. We want the feedback.