Monitoring
Health endpoint
Section titled “Health endpoint”GET /api/v1/healthReturns:
{ "status": "ok", "version": "<brain package version>", "uptime_seconds": 4281, "db": "ok", "audit_chain_ok": true, "agents_online": 42}status is ok / degraded / down. Degraded = non-critical subsystem broken (e.g. no SMTP channel reachable for invitation delivery). Down = Postgres unreachable.
Readiness and detailed health
Section titled “Readiness and detailed health”Two companions to the basic liveness endpoint:
GET /api/v1/health/readyGET /api/v1/health/system/health/ready returns 200 when the brain can serve traffic (DB pool initialised, migrations at head, registry connected). Pair it with a Kubernetes readinessProbe. /health/system (auth required) returns a deeper snapshot with per-subsystem flags, pool state, and worker presence — useful for operators staring at the dashboard during an incident.
Metrics
Section titled “Metrics”Scrape /metrics (Prometheus format, token-gated). Full metric list: metrics API.
Alerts to set
Section titled “Alerts to set”| Alert | Trigger |
|---|---|
| Brain down | up{job="z4j"} == 0 for 2m |
| Audit chain broken | z4j_audit_chain_verified == 0 |
| Agents dropping | z4j_agents_online_total falls by > 20% in 5m |
| Action error rate | rate(z4j_actions_total{status="error"}[5m]) > 0.1 |
| Task backlog growing | engine-specific (Celery/RQ/…) |
| High HTTP 5xx | rate(z4j_http_requests_total{status=~"5.."}[5m]) > 0.01 |
JSON to stdout. Fields:
ts,level,logger,msgrequest_id(per HTTP request)user_id(when authenticated)project_idagent_id(when relevant)
Ship with Fluent Bit / Vector / Loki / Datadog.
Error tracking / APM
Section titled “Error tracking / APM”z4j does not bundle a Sentry SDK or any APM client. There is no Z4J_SENTRY_DSN. Application logs go to stdout as JSON and are intended to be shipped by your log pipeline (Fluent Bit / Vector / Loki / Datadog / etc.); add Sentry, Datadog APM, or your tracer of choice at the log-pipeline layer if you want errors aggregated.