Metrics API
Endpoint
Section titled “Endpoint”GET /metricsAuthorization: Bearer <Z4J_METRICS_AUTH_TOKEN>Not a REST-versioned endpoint (Prometheus convention). When Z4J_METRICS_AUTH_TOKEN is set the request must present it; if unset, /metrics is open (the brain emits a startup WARNING reminding operators to either set the token or block /metrics at the reverse proxy).
Catalog
Section titled “Catalog”Ingest and lifecycle
Section titled “Ingest and lifecycle”z4j_events_ingested_total{engine,kind}(counter): wire-level event ingest count, labelled by engine adapter (celery, rq, …) and kind (started, succeeded, failed, retried).z4j_tasks_total{project,engine,kind}(counter): per-project task lifecycle event count.z4j_task_duration_seconds{project,engine}(histogram): task wall-clock duration buckets.z4j_commands_total{project,engine,verb,status}(counter): commands the brain has issued to agents.z4j_command_late_results_total{project,engine}(counter): commands whose result frame arrived after the command’s deadline.
Agents, workers, queues
Section titled “Agents, workers, queues”z4j_agents_online{project}(gauge): currently connected agent count.z4j_workers_online{project}(gauge): currently online worker count.z4j_queue_depth{project,queue,engine}(gauge): pending messages in a queue.z4j_ws_connections(gauge): live WebSocket connections held by this worker.
Database
Section titled “Database”z4j_db_pool_size(gauge): configured SQLAlchemy pool size. Populated at scrape time. Rises only whenpool_sizesetting changes; useful baseline.z4j_db_pool_checked_out(gauge): pool connections currently checked out (in active use). Steady-state under burst load is a key contention indicator; should be much less thanpool_size.z4j_brain_rss_bytes(gauge): brain process RSS in bytes, sampled at scrape time from/proc/self/status.0on non-Linux. The slope under sustained load is the headline leak signal; flat or slow growth is healthy, rapid growth indicates either an unbounded cache (tuneZ4J_DATABASE_STATEMENT_CACHE_SIZE) or a new retention path. See Brain memory tuning for the playbook.z4j_postgres_deadlocks_total(counter): total PostgresDeadlockDetectedErrorinstances observed via the asyncpg/SQLAlchemyhandle_errorevent listener. Should hover at or near zero in steady state; sustained non-zero rates onINSERT INTO workers/UPDATE agents/UPDATE queuespoint at lock-order contention.
Notifications
Section titled “Notifications”z4j_notifications_sent_total{project,channel_type,status}(counter): notification deliveries attempted.z4j_notifications_cooldown_skipped_total{project,trigger}(counter): dispatches skipped because the cooldown window had not elapsed.
In-memory state
Section titled “In-memory state”z4j_inmemory_state_items{subsystem}(gauge): per-subsystem item count in the brain’s in-process caches (long-poll signer registry, throttle entries, dashboard subscriptions, …). Sampled at every scrape; lets operators see brain-restart drops at a glance.
Reliability and self-watch
Section titled “Reliability and self-watch”z4j_swallowed_exceptions_total{module,site}(counter): intentional exception swallows at I/O boundaries (metric updates, WebSocket close during shutdown, asyncpg teardown). A sustained non-zero rate signals a subsystem in trouble even when no error-level log fires.z4j_background_task_error_active{task}(gauge): 1 if the named background task’s most recent pass failed, 0 otherwise.
Audit retention and WAL
Section titled “Audit retention and WAL”z4j_audit_retention_pruned_total(gauge): cumulative count of audit rows pruned by the retention worker.z4j_audit_retention_last_run_timestamp(gauge): Unix timestamp of the most recent retention pass.z4j_audit_retention_last_deleted(gauge): rows deleted in the most recent pass.z4j_wal_checkpoint_last_run_timestamp(gauge): Unix timestamp of the most recent SQLite WAL checkpoint (0on Postgres).z4j_wal_checkpoint_pages_last(gauge): pages checkpointed in the most recent SQLite WAL pass.
Grafana dashboard
Section titled “Grafana dashboard”A shareable Grafana dashboard ships in the repo at docs/perf/grafana-dashboard.json. Import in Grafana via Dashboards, New, Import, then upload the JSON. The dashboard includes four panels:
- Brain process RSS plus 5-min slope (the headline leak signal).
- Postgres deadlock rate over 5-min windows.
- DB pool: configured size vs current checked-out count.
- Event ingest rate and total agents online (for correlation against the leak signal).
Configure your Prometheus datasource to scrape the brain’s /metrics with the Z4J_METRICS_AUTH_TOKEN bearer:
scrape_configs: - job_name: z4j metrics_path: /metrics authorization: type: Bearer credentials: $Z4J_METRICS_AUTH_TOKEN static_configs: - targets: ["z4j.internal:7700"]