Skip to content

Monitoring

GET /api/v1/health

Returns:

{
"status": "ok",
"version": "1.0.0",
"uptime_seconds": 4281,
"db": "ok",
"audit_chain_ok": true,
"agents_online": 42
}

status is ok / degraded / down. Degraded = non-critical subsystem broken (e.g. SMTP unreachable for invites). Down = Postgres unreachable.

Scrape /metrics (Prometheus format, token-gated). Full metric list: metrics API.

AlertTrigger
Brain downup{job="z4j"} == 0 for 2m
Audit chain brokenz4j_audit_chain_verified == 0
Agents droppingz4j_agents_online_total falls by > 20% in 5m
Action error raterate(z4j_actions_total{status="error"}[5m]) > 0.1
Task backlog growingengine-specific (Celery/RQ/…)
High HTTP 5xxrate(z4j_http_requests_total{status=~"5.."}[5m]) > 0.01

JSON to stdout. Fields:

  • ts, level, logger, msg
  • request_id (per HTTP request)
  • user_id (when authenticated)
  • project_id
  • agent_id (when relevant)

Ship with Fluent Bit / Vector / Loki / Datadog.

z4j doesn’t ship with Sentry SDK baked in. If you want errors aggregated, set Z4J_SENTRY_DSN - we use the stdlib logging → sentry integration, nothing custom.