Skip to content

Metrics API

GET /metrics
Authorization: Bearer <Z4J_METRICS_AUTH_TOKEN>

Not a REST-versioned endpoint (Prometheus convention). When Z4J_METRICS_AUTH_TOKEN is set the request must present it; if unset, /metrics is open (the brain emits a startup WARNING reminding operators to either set the token or block /metrics at the reverse proxy).

  • z4j_events_ingested_total{engine,kind} (counter): wire-level event ingest count, labelled by engine adapter (celery, rq, …) and kind (started, succeeded, failed, retried).
  • z4j_tasks_total{project,engine,kind} (counter): per-project task lifecycle event count.
  • z4j_task_duration_seconds{project,engine} (histogram): task wall-clock duration buckets.
  • z4j_commands_total{project,engine,verb,status} (counter): commands the brain has issued to agents.
  • z4j_command_late_results_total{project,engine} (counter): commands whose result frame arrived after the command’s deadline.
  • z4j_agents_online{project} (gauge): currently connected agent count.
  • z4j_workers_online{project} (gauge): currently online worker count.
  • z4j_queue_depth{project,queue,engine} (gauge): pending messages in a queue.
  • z4j_ws_connections (gauge): live WebSocket connections held by this worker.
  • z4j_db_pool_size (gauge): configured SQLAlchemy pool size. Populated at scrape time. Rises only when pool_size setting changes; useful baseline.
  • z4j_db_pool_checked_out (gauge): pool connections currently checked out (in active use). Steady-state under burst load is a key contention indicator; should be much less than pool_size.
  • z4j_brain_rss_bytes (gauge): brain process RSS in bytes, sampled at scrape time from /proc/self/status. 0 on non-Linux. The slope under sustained load is the headline leak signal; flat or slow growth is healthy, rapid growth indicates either an unbounded cache (tune Z4J_DATABASE_STATEMENT_CACHE_SIZE) or a new retention path. See Brain memory tuning for the playbook.
  • z4j_postgres_deadlocks_total (counter): total Postgres DeadlockDetectedError instances observed via the asyncpg/SQLAlchemy handle_error event listener. Should hover at or near zero in steady state; sustained non-zero rates on INSERT INTO workers / UPDATE agents / UPDATE queues point at lock-order contention.
  • z4j_notifications_sent_total{project,channel_type,status} (counter): notification deliveries attempted.
  • z4j_notifications_cooldown_skipped_total{project,trigger} (counter): dispatches skipped because the cooldown window had not elapsed.
  • z4j_inmemory_state_items{subsystem} (gauge): per-subsystem item count in the brain’s in-process caches (long-poll signer registry, throttle entries, dashboard subscriptions, …). Sampled at every scrape; lets operators see brain-restart drops at a glance.
  • z4j_swallowed_exceptions_total{module,site} (counter): intentional exception swallows at I/O boundaries (metric updates, WebSocket close during shutdown, asyncpg teardown). A sustained non-zero rate signals a subsystem in trouble even when no error-level log fires.
  • z4j_background_task_error_active{task} (gauge): 1 if the named background task’s most recent pass failed, 0 otherwise.
  • z4j_audit_retention_pruned_total (gauge): cumulative count of audit rows pruned by the retention worker.
  • z4j_audit_retention_last_run_timestamp (gauge): Unix timestamp of the most recent retention pass.
  • z4j_audit_retention_last_deleted (gauge): rows deleted in the most recent pass.
  • z4j_wal_checkpoint_last_run_timestamp (gauge): Unix timestamp of the most recent SQLite WAL checkpoint (0 on Postgres).
  • z4j_wal_checkpoint_pages_last (gauge): pages checkpointed in the most recent SQLite WAL pass.

A shareable Grafana dashboard ships in the repo at docs/perf/grafana-dashboard.json. Import in Grafana via Dashboards, New, Import, then upload the JSON. The dashboard includes four panels:

  1. Brain process RSS plus 5-min slope (the headline leak signal).
  2. Postgres deadlock rate over 5-min windows.
  3. DB pool: configured size vs current checked-out count.
  4. Event ingest rate and total agents online (for correlation against the leak signal).

Configure your Prometheus datasource to scrape the brain’s /metrics with the Z4J_METRICS_AUTH_TOKEN bearer:

prometheus.yml
scrape_configs:
- job_name: z4j
metrics_path: /metrics
authorization:
type: Bearer
credentials: $Z4J_METRICS_AUTH_TOKEN
static_configs:
- targets: ["z4j.internal:7700"]