Monitoring
Health endpoint
Section titled “Health endpoint”GET /api/v1/healthReturns:
{ "status": "ok", "version": "1.0.0", "uptime_seconds": 4281, "db": "ok", "audit_chain_ok": true, "agents_online": 42}status is ok / degraded / down. Degraded = non-critical subsystem broken (e.g. SMTP unreachable for invites). Down = Postgres unreachable.
Metrics
Section titled “Metrics”Scrape /metrics (Prometheus format, token-gated). Full metric list: metrics API.
Alerts to set
Section titled “Alerts to set”| Alert | Trigger |
|---|---|
| Brain down | up{job="z4j"} == 0 for 2m |
| Audit chain broken | z4j_audit_chain_verified == 0 |
| Agents dropping | z4j_agents_online_total falls by > 20% in 5m |
| Action error rate | rate(z4j_actions_total{status="error"}[5m]) > 0.1 |
| Task backlog growing | engine-specific (Celery/RQ/…) |
| High HTTP 5xx | rate(z4j_http_requests_total{status=~"5.."}[5m]) > 0.01 |
JSON to stdout. Fields:
ts,level,logger,msgrequest_id(per HTTP request)user_id(when authenticated)project_idagent_id(when relevant)
Ship with Fluent Bit / Vector / Loki / Datadog.
Sentry / APM
Section titled “Sentry / APM”z4j doesn’t ship with Sentry SDK baked in. If you want errors aggregated, set Z4J_SENTRY_DSN - we use the stdlib logging → sentry integration, nothing custom.