Skip to content

WebSocket protocol

Protocol version: 2. v2 adds a per-frame HMAC envelope plus replay protection; v1 is not accepted on the wire. See the wire protocol concept for the narrative; this page is the schema reference. The canonical definitions live in z4j_core.transport.frames.

wss://<brain>/ws
Authorization: Bearer <agent-token>

No subprotocol. Plain WebSocket carrying JSON frames.

Every frame:

{
v: 2, // protocol version
type: "<string>",
id: string, // 1..64 chars, agent-generated
ts: string | null, // RFC 3339, optional
// signed frames also carry:
nonce: string, // up to 32 chars
seq: number, // monotonic per agent
hmac: string, // base64; computed by FrameSigner
payload: { ... } // type-specific
}

Stateful frames (event_batch, event_batch_ack, heartbeat, command, command_ack, command_result, registry_delta, error, agent_status) are signed. The handshake pair (hello / hello_ack) is unsigned because the agent and brain are still negotiating which key to use.

First frame the agent sends. The brain validates protocol_version and accepts only "2".

{
type: "hello",
payload: {
protocol_version: "2",
agent_version: string,
framework: string, // "django" | "flask" | "fastapi" | "bare"
engines: string[], // up to 64
schedulers: string[], // up to 64
capabilities: Record<string, string[]>,
host: Record<string, any>,
// optional, worker-first protocol (one connection per worker):
worker_id?: string,
worker_role?: "web" | "task" | "scheduler" | "beat" | "other",
worker_pid?: number,
worker_started_at?: string
}
}

The hot path. events is capped at 5000 entries; the agent’s batcher caps itself at 500.

{
type: "event_batch",
payload: {
events: Record<string, any>[]
}
}
{ type: "heartbeat", payload: {} }

Default cadence is 10 seconds; the brain returns its preferred heartbeat_interval_seconds in hello_ack.

{
type: "command_result",
payload: {
command_id: string,
ok: boolean,
error?: { code: string, message: string },
result?: any
}
}

Schedule / engine registry updates. The brain treats it as additive state.

Brain’s response to a successful hello.

{
type: "hello_ack",
payload: {
protocol_version: "2",
brain_version: string,
agent_id: string,
project_id: string,
session_id: string,
heartbeat_interval_seconds: 10, // default
max_frame_size_bytes: 1048576 // default 1 MiB
}
}

Round-trip ack so the agent knows which buffered batch it can drop.

{
type: "event_batch_ack",
payload: {
acked_id: string, // matches the original event_batch.id
received: number,
accepted: number,
rejected: number
}
}

Dispatched in response to a REST call against /api/v1/projects/{slug}/commands/.... The set of valid verb values matches the routes there: retry_task, cancel_task, bulk_retry, purge_queue, restart_worker, pool_resize, add_consumer, cancel_consumer, rate_limit.

{
type: "command",
payload: {
command_id: string,
verb: string,
args: Record<string, any>
}
}

Brain confirms receipt of a command_result before the agent drops its in-flight record.

Fatal protocol error; the brain will close the socket immediately after.

Brain pushes status changes (e.g. another worker joined / left under the same agent_id).

Observed on the brain side:

CodeMeaning
1000Normal closure.
1011Internal error.
4400Malformed handshake or non-hello first frame.
4401Bearer rejected.
4408Idle timeout reached without a heartbeat.
4426Protocol version not in SUPPORTED_PROTOCOLS.
4429Per-agent connection cap exceeded.

Dashboard-side (/ws/dashboard) uses a separate set: 4400 (bad request), 4401 (no session), 4402 (session-bound origin mismatch), 4403 (insufficient role), 4408 (idle).

Reconnect with exponential backoff on 1006, 1011, 4408, and similar transient codes. On 4401, 4426, or 4429, stop and surface the error — these are configuration problems.