Grafana
z4j ships five Grafana dashboards in deploy/grafana/. They are plain JSON files, schema-compatible with Grafana 10.4+, and they work against any Prometheus that is scraping the brain’s /metrics endpoint.
Dashboards
Section titled “Dashboards”| File | Title | Purpose |
|---|---|---|
z4j-overview.json | z4j — Brain overview | First stop. Stat panels for agents online, brain RSS, DB pool utilisation, deadlock rate. Task throughput by final state. Task duration p50 / p95 / p99. Queue depth by project. Background-task error flags. Swallowed exceptions by module. |
z4j-tasks.json | z4j — Tasks | Tail latency or failure rate climbing? Open this. Stacked task throughput by state, failure-rate ratio, full duration heatmap, top-10 failing / slow (p99) / by-volume / retried task names. |
z4j-agents.json | z4j — Agents and commands | Agent + worker counts per project. Command dispatch flow by status and action. Late-result counter (tuning signal for Z4J_COMMAND_TIMEOUT_SECONDS). Live WebSocket connection count. In-memory state by subsystem. |
z4j-notifications.json | z4j — Notifications | Send rate by channel type and status. Failure-rate table (per channel, red rows are channels currently failing). Cooldown skip rate per trigger. 24h channel mix donut. Blocked-by-SSRF / host-lock rate. |
z4j-scheduler.json | z4j-scheduler | Only relevant when the scheduler companion is deployed. Leader status, fire throughput by terminal state, fire latency p50/p99 against the §23 SLI budget, tick drift, per-schedule top-N. |
Each brain dashboard exposes a $project template variable (multi-select, all by default) so multi-tenant deployments can scope panels to a single project without editing JSON.
Importing
Section titled “Importing”Manual
Section titled “Manual”- Grafana, Dashboards, New, Import.
- Upload the JSON.
- Pick your Prometheus datasource.
- Save.
Provisioned (production)
Section titled “Provisioned (production)”Mount deploy/grafana/ into the Grafana container and provision a file-based dashboard provider. A complete example sits in deploy/grafana/README.md. Grafana picks up changes within updateIntervalSeconds of a JSON edit, so the dashboards become a normal infra-as-code artefact.
For the Kubernetes Grafana Helm chart, drop the JSONs into a ConfigMap and point dashboardsConfigMaps at it.
Prometheus scrape config
Section titled “Prometheus scrape config”Brain (default port 7700, fail-secure metrics auth):
scrape_configs: - job_name: z4j-brain metrics_path: /metrics static_configs: - targets: ["brain.internal:7700"] authorization: type: Bearer credentials: "<Z4J_METRICS_AUTH_TOKEN>"Z4J_METRICS_AUTH_TOKEN is auto-minted on first boot and persisted to $Z4J_HOME/secret.env. On trusted-LAN deployments you can flip Z4J_METRICS_PUBLIC=true and drop the authorization block; the brain logs a loud WARNING at startup naming the risk.
Scheduler companion (only if z4j-scheduler is deployed):
- job_name: z4j-scheduler metrics_path: /metrics static_configs: - targets: ["scheduler.internal:9100"]Why five and not one
Section titled “Why five and not one”Five small dashboards over one mega-dashboard is deliberate:
- Each dashboard fits a single screen at 1080p without horizontal scroll.
- The split matches the operator’s natural drill path: overview, then per-area (tasks / agents / notifications), then per-row (the top-N panels link the operator to the exact task name or channel that is misbehaving).
- You can permission them independently in Grafana folders — the on-call team often does not need write access to the scheduler dashboard, for example.
Suggested alerts
Section titled “Suggested alerts”A full table sits in deploy/grafana/README.md with starter thresholds for:
- zero agents online
- DB pool saturated
- sustained Postgres deadlocks
- a self-watch background task failing (audit retention, WAL checkpoint, etc.)
- task failure rate above 5%
- late command results (timeout-sweeper mistune)
- notification channel failing above 10%
- swallowed-exception spike by module
- brain not reporting metrics at all (catch-all liveness)
The thresholds are starting points; tune to your fleet’s normal baseline.
Legacy
Section titled “Legacy”A historical leak-investigation snapshot lives at docs/perf/grafana-dashboard.json (four panels covering RSS slope, deadlock rate, DB pool utilisation, and events context). It is the dashboard we used to validate the 1.5.1 connection-pool leak fix and is kept in the repo for reproducibility. The new deploy/grafana/z4j-overview.json covers the same signals plus much more; new deployments should use the deploy/grafana/ set.