Grafana

z4j ships five Grafana dashboards in deploy/grafana/. They are plain JSON files, schema-compatible with Grafana 10.4+, and they work against any Prometheus that is scraping the brain’s /metrics endpoint.

Dashboards

File	Title	Purpose
`z4j-overview.json`	z4j — Brain overview	First stop. Stat panels for agents online, brain RSS, DB pool utilisation, deadlock rate. Task throughput by final state. Task duration p50 / p95 / p99. Queue depth by project. Background-task error flags. Swallowed exceptions by module.
`z4j-tasks.json`	z4j — Tasks	Tail latency or failure rate climbing? Open this. Stacked task throughput by state, failure-rate ratio, full duration heatmap, top-10 failing / slow (p99) / by-volume / retried task names.
`z4j-agents.json`	z4j — Agents and commands	Agent + worker counts per project. Command dispatch flow by status and action. Late-result counter (tuning signal for `Z4J_COMMAND_TIMEOUT_SECONDS`). Live WebSocket connection count. In-memory state by subsystem.
`z4j-notifications.json`	z4j — Notifications	Send rate by channel type and status. Failure-rate table (per channel, red rows are channels currently failing). Cooldown skip rate per trigger. 24h channel mix donut. Blocked-by-SSRF / host-lock rate.
`z4j-scheduler.json`	z4j-scheduler	Only relevant when the scheduler companion is deployed. Leader status, fire throughput by terminal state, fire latency p50/p99 against the §23 SLI budget, tick drift, per-schedule top-N.

Each brain dashboard exposes a $project template variable (multi-select, all by default) so multi-tenant deployments can scope panels to a single project without editing JSON.

Importing

Manual

Grafana, Dashboards, New, Import.
Upload the JSON.
Pick your Prometheus datasource.
Save.

Provisioned (production)

Mount deploy/grafana/ into the Grafana container and provision a file-based dashboard provider. A complete example sits in deploy/grafana/README.md. Grafana picks up changes within updateIntervalSeconds of a JSON edit, so the dashboards become a normal infra-as-code artefact.

For the Kubernetes Grafana Helm chart, drop the JSONs into a ConfigMap and point dashboardsConfigMaps at it.

Prometheus scrape config

Brain (default port 7700, fail-secure metrics auth):

scrape_configs:
  - job_name: z4j-brain
    metrics_path: /metrics
    static_configs:
      - targets: ["brain.internal:7700"]
    authorization:
      type: Bearer
      credentials: "<Z4J_METRICS_AUTH_TOKEN>"

Z4J_METRICS_AUTH_TOKEN is auto-minted on first boot and persisted to $Z4J_HOME/secret.env. On trusted-LAN deployments you can flip Z4J_METRICS_PUBLIC=true and drop the authorization block; the brain logs a loud WARNING at startup naming the risk.

Scheduler companion (only if z4j-scheduler is deployed):

  - job_name: z4j-scheduler
    metrics_path: /metrics
    static_configs:
      - targets: ["scheduler.internal:9100"]

Why five and not one

Five small dashboards over one mega-dashboard is deliberate:

Each dashboard fits a single screen at 1080p without horizontal scroll.
The split matches the operator’s natural drill path: overview, then per-area (tasks / agents / notifications), then per-row (the top-N panels link the operator to the exact task name or channel that is misbehaving).
You can permission them independently in Grafana folders — the on-call team often does not need write access to the scheduler dashboard, for example.

Suggested alerts

A full table sits in deploy/grafana/README.md with starter thresholds for:

zero agents online
DB pool saturated
sustained Postgres deadlocks
a self-watch background task failing (audit retention, WAL checkpoint, etc.)
task failure rate above 5%
late command results (timeout-sweeper mistune)
notification channel failing above 10%
swallowed-exception spike by module
brain not reporting metrics at all (catch-all liveness)

The thresholds are starting points; tune to your fleet’s normal baseline.

Legacy

A historical leak-investigation snapshot lives at docs/perf/grafana-dashboard.json (four panels covering RSS slope, deadlock rate, DB pool utilisation, and events context). It is the dashboard we used to validate the 1.5.1 connection-pool leak fix and is kept in the repo for reproducibility. The new deploy/grafana/z4j-overview.json covers the same signals plus much more; new deployments should use the deploy/grafana/ set.