Monitoring

Health endpoint

GET /api/v1/health

Returns:

{
  "status": "ok",
  "version": "<brain package version>",
  "uptime_seconds": 4281,
  "db": "ok",
  "audit_chain_ok": true,
  "agents_online": 42
}

status is ok / degraded / down. Degraded = non-critical subsystem broken (e.g. no SMTP channel reachable for invitation delivery). Down = Postgres unreachable.

Readiness and detailed health

Two companions to the basic liveness endpoint:

GET /api/v1/health/ready
GET /api/v1/health/system

/health/ready returns 200 when the brain can serve traffic (DB pool initialised, migrations at head, registry connected). Pair it with a Kubernetes readinessProbe. /health/system (auth required) returns a deeper snapshot with per-subsystem flags, pool state, and worker presence — useful for operators staring at the dashboard during an incident.

Metrics

Scrape /metrics (Prometheus format, token-gated). Full metric list: metrics API.

Alerts to set

Alert	Trigger
Brain down	`up{job="z4j"} == 0` for 2m
Audit chain broken	`z4j_audit_chain_verified == 0`
Agents dropping	`z4j_agents_online_total` falls by > 20% in 5m
Action error rate	`rate(z4j_actions_total{status="error"}[5m]) > 0.1`
Task backlog growing	engine-specific (Celery/RQ/…)
High HTTP 5xx	`rate(z4j_http_requests_total{status=~"5.."}[5m]) > 0.01`

Logs

JSON to stdout. Fields:

ts, level, logger, msg
request_id (per HTTP request)
user_id (when authenticated)
project_id
agent_id (when relevant)

Ship with Fluent Bit / Vector / Loki / Datadog.

Error tracking / APM

z4j does not bundle a Sentry SDK or any APM client. There is no Z4J_SENTRY_DSN. Application logs go to stdout as JSON and are intended to be shipped by your log pipeline (Fluent Bit / Vector / Loki / Datadog / etc.); add Sentry, Datadog APM, or your tracer of choice at the log-pipeline layer if you want errors aggregated.