Brain memory tuning

Under sustained burst load, asyncpg’s per-connection prepared-statement cache can hold tens of thousands of compiled statements in C memory across the pool while Python heap stays flat. The settings below cap that cache and rotate connections often enough that brain RSS stays bounded. This page covers the knobs and the metrics to watch.

Defaults

Setting	Default	What it caps
`Z4J_DATABASE_STATEMENT_CACHE_SIZE`	`50`	Prepared statements retained per asyncpg connection. `0` disables caching entirely.
`Z4J_DATABASE_MAX_INACTIVE_CONNECTION_LIFETIME_SECONDS`	`60`	Seconds an idle connection lives in the SQLAlchemy pool before it is recycled. Recycling drops the per-connection cache as a side effect.

These defaults target a sustained mixed-query brain on Postgres with the stock 20+10 pool. They hold RSS slope well below the asyncpg-default cache (statement_cache_size=100 per connection, no LRU cap) while preserving the prepare-once speedup on hot queries.

When to lower the cache size

Drop Z4J_DATABASE_STATEMENT_CACHE_SIZE toward 0 when:

The brain is colocated on a memory-constrained host (under 1 GiB available).
You scrape z4j_brain_rss_bytes and see slow growth that does not flatten between traffic peaks.
Your workload is “many distinct queries, low repeat rate” rather than “few hot queries hit constantly”. The cache only helps when the same prepared statement gets reused.

Cost: each query pays the prepare round-trip again, and brain WebSocket throughput drops measurably. Treat 0 as opt-in for memory-constrained deploys only; the default of 50 is the right starting point for most operators.

When to shorten `pool_recycle`

Shorten Z4J_DATABASE_MAX_INACTIVE_CONNECTION_LIFETIME_SECONDS below 60 when:

You run with Z4J_DATABASE_STATEMENT_CACHE_SIZE at its default and still want a faster cache eviction floor.
Your traffic is bursty enough that connections sit idle for long stretches between bursts, accumulating stale prepared plans.

Cost: more connect/reconnect churn against Postgres. Below about 30s you start to see meaningful overhead from the asyncpg/Postgres handshake.

What to watch

Scrape /metrics and chart these three signals together:

z4j_brain_rss_bytes (gauge, sampled from /proc/self/status at scrape time). The slope is the headline leak signal. Flat or slow growth is healthy; rapid growth points at an unbounded cache or a new retention path. 0 on non-Linux.
z4j_db_pool_checked_out vs z4j_db_pool_size (gauges). Steady-state checked-out should be well below pool size. If it pins to the max, the cache playbook will not help; you need more pool capacity or fewer concurrent writers.
z4j_postgres_deadlocks_total (counter, wired via the SQLAlchemy handle_error listener). Should hover at or near zero in steady state. Sustained non-zero rates on INSERT INTO workers, UPDATE agents, or UPDATE queues indicate lock-order contention; check that any new write site sorts its keys before INSERT/UPDATE.

The shipped Grafana dashboard at docs/perf/grafana-dashboard.json renders all three plus the event-ingest rate for correlation. Import it via Dashboards, New, Import.

Pool sizing

pool_size=20 and max_overflow=10 are hard-coded at engine creation and not currently exposed as settings. Contention symptoms (sustained high z4j_db_pool_checked_out) therefore require a code change rather than a config change. Most deployments stay comfortably below saturation; if yours does not, file an issue with your /metrics output.