Brain memory tuning
Under sustained burst load, asyncpg’s per-connection prepared-statement cache can hold tens of thousands of compiled statements in C memory across the pool while Python heap stays flat. The settings below cap that cache and rotate connections often enough that brain RSS stays bounded. This page covers the knobs and the metrics to watch.
Defaults
Section titled “Defaults”| Setting | Default | What it caps |
|---|---|---|
Z4J_DATABASE_STATEMENT_CACHE_SIZE | 50 | Prepared statements retained per asyncpg connection. 0 disables caching entirely. |
Z4J_DATABASE_MAX_INACTIVE_CONNECTION_LIFETIME_SECONDS | 60 | Seconds an idle connection lives in the SQLAlchemy pool before it is recycled. Recycling drops the per-connection cache as a side effect. |
These defaults target a sustained mixed-query brain on Postgres with the stock 20+10 pool. They hold RSS slope well below the asyncpg-default cache (statement_cache_size=100 per connection, no LRU cap) while preserving the prepare-once speedup on hot queries.
When to lower the cache size
Section titled “When to lower the cache size”Drop Z4J_DATABASE_STATEMENT_CACHE_SIZE toward 0 when:
- The brain is colocated on a memory-constrained host (under 1 GiB available).
- You scrape
z4j_brain_rss_bytesand see slow growth that does not flatten between traffic peaks. - Your workload is “many distinct queries, low repeat rate” rather than “few hot queries hit constantly”. The cache only helps when the same prepared statement gets reused.
Cost: each query pays the prepare round-trip again, and brain WebSocket throughput drops measurably. Treat 0 as opt-in for memory-constrained deploys only; the default of 50 is the right starting point for most operators.
When to shorten pool_recycle
Section titled “When to shorten pool_recycle”Shorten Z4J_DATABASE_MAX_INACTIVE_CONNECTION_LIFETIME_SECONDS below 60 when:
- You run with
Z4J_DATABASE_STATEMENT_CACHE_SIZEat its default and still want a faster cache eviction floor. - Your traffic is bursty enough that connections sit idle for long stretches between bursts, accumulating stale prepared plans.
Cost: more connect/reconnect churn against Postgres. Below about 30s you start to see meaningful overhead from the asyncpg/Postgres handshake.
What to watch
Section titled “What to watch”Scrape /metrics and chart these three signals together:
z4j_brain_rss_bytes(gauge, sampled from/proc/self/statusat scrape time). The slope is the headline leak signal. Flat or slow growth is healthy; rapid growth points at an unbounded cache or a new retention path.0on non-Linux.z4j_db_pool_checked_outvsz4j_db_pool_size(gauges). Steady-state checked-out should be well below pool size. If it pins to the max, the cache playbook will not help; you need more pool capacity or fewer concurrent writers.z4j_postgres_deadlocks_total(counter, wired via the SQLAlchemyhandle_errorlistener). Should hover at or near zero in steady state. Sustained non-zero rates onINSERT INTO workers,UPDATE agents, orUPDATE queuesindicate lock-order contention; check that any new write site sorts its keys before INSERT/UPDATE.
The shipped Grafana dashboard at docs/perf/grafana-dashboard.json renders all three plus the event-ingest rate for correlation. Import it via Dashboards, New, Import.
Pool sizing
Section titled “Pool sizing”pool_size=20 and max_overflow=10 are hard-coded at engine creation and not currently exposed as settings. Contention symptoms (sustained high z4j_db_pool_checked_out) therefore require a code change rather than a config change. Most deployments stay comfortably below saturation; if yours does not, file an issue with your /metrics output.