OpenTelemetry
z4j brain ships an optional OpenTelemetry hook. With an OTLP endpoint configured, FastAPI HTTP server requests, SQLAlchemy queries, and outbound httpx calls are traced and exported to a collector of your choosing. The integration is off by default, opt-in via a single env var, and the SDK is loaded lazily so a misconfiguration cannot prevent boot.
What gets traced
Section titled “What gets traced”| Source | Span kind | Default sample rate |
|---|---|---|
| FastAPI HTTP requests | server | 0 (set Z4J_OTEL_TRACES_SAMPLER_ARG=0.05 to collect 5%) |
| SQLAlchemy queries against the brain’s primary engine | client (DB) | Inherits the parent span’s sampling decision |
| Outbound httpx calls (notification dispatchers, version check) | client (HTTP) | Inherits the parent span’s sampling decision |
| WebSocket dispatch, command issuance, task ingestion | not yet | Deferred; the wire protocol needs a trace-context header for cross-process spans to be useful. Candidate for a later minor. |
/health* and /metrics are excluded by default. They carry too much background traffic for any sampling budget to be meaningful; tracing them swamps the operator’s collector with noise. Flip Z4J_OTEL_INCLUDE_HEALTH=true to re-enable.
Enabling
Section titled “Enabling”Install the optional dependency, then set the endpoint:
pip install 'z4j[otel]'# In your env file or systemd unitZ4J_OTEL_EXPORTER_OTLP_ENDPOINT=https://api.honeycomb.io/v1/tracesZ4J_OTEL_EXPORTER_OTLP_HEADERS=x-honeycomb-team=<your-api-key>Z4J_OTEL_TRACES_SAMPLER_ARG=0.05Restart the brain. On boot you should see:
INFO z4j.brain.observability.otel: OpenTelemetry initialised (endpoint=https://api.honeycomb.io/v1/traces, protocol=http/protobuf, sampler_arg=0.050, include_health=False)If the SDK is not installed when the endpoint is set, the brain logs a single WARNING explaining what to install and continues running without OTel.
Settings
Section titled “Settings”All settings are prefixed Z4J_ and read from your env file or the process environment.
| Variable | Default | Notes |
|---|---|---|
Z4J_OTEL_EXPORTER_OTLP_ENDPOINT | unset | When unset OR empty, every other knob below is ignored. SecretStr at the Pydantic layer so a path-embedded API key never lands in startup logs. |
Z4J_OTEL_PROTOCOL | http/protobuf | One of http/protobuf, http (alias), grpc. gRPC additionally needs pip install opentelemetry-exporter-otlp-proto-grpc. |
Z4J_OTEL_EXPORTER_OTLP_HEADERS | unset | Comma-separated key=value pairs forwarded as the OTLP exporter’s headers. The standard place to set x-honeycomb-team, authorization, etc. SecretStr. |
Z4J_OTEL_SERVICE_NAME | z4j-brain | Resource attribute service.name. Multi-brain deployments set this to distinguish them in the collector UI. |
Z4J_OTEL_SERVICE_NAMESPACE | z4j | Resource attribute service.namespace. Groups every z4j service together in the collector. |
Z4J_OTEL_ENVIRONMENT | unset | Resource attribute deployment.environment. Defaults to Z4J_ENVIRONMENT (production / staging / dev). |
Z4J_OTEL_TRACES_SAMPLER_ARG | 0.0 | TraceIdRatioBased sampler argument, 0.0..1.0. Default 0 = no traces sampled. Wrapped in ParentBased(remote_parent_sampled=ALWAYS_OFF, remote_parent_not_sampled=ALWAYS_OFF) so a spoofed inbound traceparent cannot force-sample requests; only this brain’s own ratio sampler decides whether to record. Cross-process trace context propagation is deferred to a later minor. |
Z4J_OTEL_INCLUDE_HEALTH | false | When true, /health* and /metrics are traced like any other endpoint. Default false. The full default exclude list is /health, /api/v1/health, /api/v1/health/, /metrics, /api/v1/auth (auth routes carry credentials in POST bodies; excluded by default). |
Z4J_OTEL_EXCLUDED_URL_PATTERNS | empty | Comma-separated URL substrings to additionally exclude. Layered on top of the health-exclusion default. |
Out-of-range sampler args fail validation at startup. A typo like Z4J_OTEL_TRACES_SAMPLER_ARG=1.5 raises a Pydantic ValidationError before the FastAPI app is built.
Resource attributes
Section titled “Resource attributes”Every span carries:
service.name = z4j-brain (or your override)service.namespace = z4j (or your override)service.version = <z4j package version> (omitted on editable installs without metadata)deployment.environment = <Z4J_ENVIRONMENT> (or otel_environment override)The build_resource_attributes helper is exposed and unit-tested so the attribute set is pinned at the source rather than the collector side.
Collectors
Section titled “Collectors”Tested OTLP endpoints:
- Honeycomb:
https://api.honeycomb.io/v1/traces+x-honeycomb-teamheader. HTTP only. - Grafana Tempo:
https://tempo-prod-04-prod-us-east-0.grafana.net/tempo+ basic-auth header. HTTP or gRPC. - Local Jaeger (
jaegertracing/all-in-one:latest):http://localhost:4318/v1/traces(HTTP) orlocalhost:4317(gRPC). - Local OpenTelemetry Collector (
otel/opentelemetry-collector-contrib):http://localhost:4318/v1/traces.
The brain only ships traces over OTLP. Metrics export over OTLP is not enabled in this release: the Prometheus /metrics endpoint is the canonical metric surface and dual-exporting would just create reconciliation work.
Threat model
Section titled “Threat model”The OTLP exporter ships span attributes (URLs, DB statement fragments, response codes, header names if the instrumentation captures them) to a collector outside the brain. Pre-1.6 deployments expect no outbound traffic from the brain to anywhere except configured notification destinations; enabling OTel changes that contract. Review what the auto-instrumentations attach BEFORE pointing at a multi-tenant collector. In particular:
- The FastAPI instrumentation attaches the request path and method. Path parameters become span attributes; if you embed an opaque token in the URL it will appear in spans. Move it to a header.
- The SQLAlchemy instrumentation runs with
enable_commenter=Falseso it does not embed SQL comments in your queries. The statement text itself is captured; sensitive queries (the brain has none by default, but operator-added schema might) leak by name. - The httpx instrumentation captures outbound URLs. Webhook dispatch to Slack / Discord / Teams sends to URLs that embed credentials in the path; without scrubbing, the OTLP exporter ships those URLs to whatever collector you configure. The brain installs a
request_hook+response_hookpair on the httpx instrumentation that overwriteshttp.url,http.target,url.full,url.path,url.querywith/[redacted by z4j]when the destination host suffix-matches the credential-bearing set:outlook.office.com,*.webhook.office.com,*.logic.azure.com,hooks.slack.com,discord.com,discordapp.com,*.slack.com,*.discordapp.com,*.discord.com,*.pagerduty.com. The hook is fail-closed: if the scrubber itself raises (a future SDK upgrade renamesrequest.urletc.) the URL is overwritten to a generichttps://[unknown]/[redacted by z4j]marker and a WARNING is logged. The unscrubbed URL never reaches the OTLP exporter.
Disabling
Section titled “Disabling”Unset Z4J_OTEL_EXPORTER_OTLP_ENDPOINT and restart. The SDK does not need to be uninstalled; an unset endpoint is a complete no-op.
Workers and the scheduler
Section titled “Workers and the scheduler”Adapter-side workers (z4j-celery, z4j-django, etc.) and the scheduler companion currently do NOT initialise their own OTel SDK; only the brain process does. Cross-process trace context propagation (so a span starting on an agent connects to a span on the brain) requires a wire-protocol header that is a candidate for a later minor.