Skip to content

Monitoring

GET /api/v1/health

Returns:

{
"status": "ok",
"version": "<brain package version>",
"uptime_seconds": 4281,
"db": "ok",
"audit_chain_ok": true,
"agents_online": 42
}

status is ok / degraded / down. Degraded = non-critical subsystem broken (e.g. no SMTP channel reachable for invitation delivery). Down = Postgres unreachable.

Two companions to the basic liveness endpoint:

GET /api/v1/health/ready
GET /api/v1/health/system

/health/ready returns 200 when the brain can serve traffic (DB pool initialised, migrations at head, registry connected). Pair it with a Kubernetes readinessProbe. /health/system (auth required) returns a deeper snapshot with per-subsystem flags, pool state, and worker presence — useful for operators staring at the dashboard during an incident.

Scrape /metrics (Prometheus format, token-gated). Full metric list: metrics API.

AlertTrigger
Brain downup{job="z4j"} == 0 for 2m
Audit chain brokenz4j_audit_chain_verified == 0
Agents droppingz4j_agents_online_total falls by > 20% in 5m
Action error raterate(z4j_actions_total{status="error"}[5m]) > 0.1
Task backlog growingengine-specific (Celery/RQ/…)
High HTTP 5xxrate(z4j_http_requests_total{status=~"5.."}[5m]) > 0.01

JSON to stdout. Fields:

  • ts, level, logger, msg
  • request_id (per HTTP request)
  • user_id (when authenticated)
  • project_id
  • agent_id (when relevant)

Ship with Fluent Bit / Vector / Loki / Datadog.

z4j does not bundle a Sentry SDK or any APM client. There is no Z4J_SENTRY_DSN. Application logs go to stdout as JSON and are intended to be shipped by your log pipeline (Fluent Bit / Vector / Loki / Datadog / etc.); add Sentry, Datadog APM, or your tracer of choice at the log-pipeline layer if you want errors aggregated.