Metrics API

Endpoint

GET /metrics
Authorization: Bearer <Z4J_METRICS_AUTH_TOKEN>

Not a REST-versioned endpoint (Prometheus convention). When Z4J_METRICS_AUTH_TOKEN is set the request must present it; if unset, /metrics is open (the brain emits a startup WARNING reminding operators to either set the token or block /metrics at the reverse proxy).

Catalog

Ingest and lifecycle

z4j_events_ingested_total{engine,kind} (counter): wire-level event ingest count, labelled by engine adapter (celery, rq, …) and kind (started, succeeded, failed, retried).
z4j_tasks_total{project,engine,kind} (counter): per-project task lifecycle event count.
z4j_task_duration_seconds{project,engine} (histogram): task wall-clock duration buckets.
z4j_commands_total{project,engine,verb,status} (counter): commands the brain has issued to agents.
z4j_command_late_results_total{project,engine} (counter): commands whose result frame arrived after the command’s deadline.

Agents, workers, queues

z4j_agents_online{project} (gauge): currently connected agent count.
z4j_workers_online{project} (gauge): currently online worker count.
z4j_queue_depth{project,queue,engine} (gauge): pending messages in a queue.
z4j_ws_connections (gauge): live WebSocket connections held by this worker.

Database

z4j_db_pool_size (gauge): configured SQLAlchemy pool size. Populated at scrape time. Rises only when pool_size setting changes; useful baseline.
z4j_db_pool_checked_out (gauge): pool connections currently checked out (in active use). Steady-state under burst load is a key contention indicator; should be much less than pool_size.
z4j_brain_rss_bytes (gauge): brain process RSS in bytes, sampled at scrape time from /proc/self/status. 0 on non-Linux. The slope under sustained load is the headline leak signal; flat or slow growth is healthy, rapid growth indicates either an unbounded cache (tune Z4J_DATABASE_STATEMENT_CACHE_SIZE) or a new retention path. See Brain memory tuning for the playbook.
z4j_postgres_deadlocks_total (counter): total Postgres DeadlockDetectedError instances observed via the asyncpg/SQLAlchemy handle_error event listener. Should hover at or near zero in steady state; sustained non-zero rates on INSERT INTO workers / UPDATE agents / UPDATE queues point at lock-order contention.

Notifications

z4j_notifications_sent_total{project,channel_type,status} (counter): notification deliveries attempted.
z4j_notifications_cooldown_skipped_total{project,trigger} (counter): dispatches skipped because the cooldown window had not elapsed.

In-memory state

z4j_inmemory_state_items{subsystem} (gauge): per-subsystem item count in the brain’s in-process caches (long-poll signer registry, throttle entries, dashboard subscriptions, …). Sampled at every scrape; lets operators see brain-restart drops at a glance.

Reliability and self-watch

z4j_swallowed_exceptions_total{module,site} (counter): intentional exception swallows at I/O boundaries (metric updates, WebSocket close during shutdown, asyncpg teardown). A sustained non-zero rate signals a subsystem in trouble even when no error-level log fires.
z4j_background_task_error_active{task} (gauge): 1 if the named background task’s most recent pass failed, 0 otherwise.

Audit retention and WAL

z4j_audit_retention_pruned_total (gauge): cumulative count of audit rows pruned by the retention worker.
z4j_audit_retention_last_run_timestamp (gauge): Unix timestamp of the most recent retention pass.
z4j_audit_retention_last_deleted (gauge): rows deleted in the most recent pass.
z4j_wal_checkpoint_last_run_timestamp (gauge): Unix timestamp of the most recent SQLite WAL checkpoint (0 on Postgres).
z4j_wal_checkpoint_pages_last (gauge): pages checkpointed in the most recent SQLite WAL pass.

Grafana dashboard

A shareable Grafana dashboard ships in the repo at docs/perf/grafana-dashboard.json. Import in Grafana via Dashboards, New, Import, then upload the JSON. The dashboard includes four panels:

Brain process RSS plus 5-min slope (the headline leak signal).
Postgres deadlock rate over 5-min windows.
DB pool: configured size vs current checked-out count.
Event ingest rate and total agents online (for correlation against the leak signal).

Configure your Prometheus datasource to scrape the brain’s /metrics with the Z4J_METRICS_AUTH_TOKEN bearer:

scrape_configs:
  - job_name: z4j
    metrics_path: /metrics
    authorization:
      type: Bearer
      credentials: $Z4J_METRICS_AUTH_TOKEN
    static_configs:
      - targets: ["z4j.internal:7700"]