Unified action surface
The problem
Section titled “The problem”Every engine implements “retry” and “cancel” differently:
- Celery -
task.retry()from inside the task, or re-apply withapp.send_task. - RQ -
Queue.requeue(job_id), orjob.requeue(). - Dramatiq - no first-class retry API; middleware-based.
- Huey - retries are configured per-task; no ad-hoc retry surface.
- arq - no retry API; re-enqueue manually.
- taskiq - per-task retry decorators.
Surfacing these differences to users ruins the value proposition. A dashboard where “Retry” does different things on different engines is not a control plane.
The z4j answer
Section titled “The z4j answer”Unified verbs. Four actions:
| Verb | Semantics |
|---|---|
retry | Re-enqueue one task with its original payload. Mark the old record as retried_as=<new_id>. |
cancel | Best-effort cancel if not yet running; mark as cancelled if already running (worker decides). |
bulk_retry | Apply retry to a filter set (state + name + time window), capped at 10k per call. |
purge_queue | Drop all pending messages on a named queue (dangerous; requires operator+ role). |
Native vs. polyfilled
Section titled “Native vs. polyfilled”Each engine adapter advertises its capability map in the hello frame:
"celery": { "native_retry": true, "native_cancel": true, "bulk_retry": false, // polyfilled "purge": true}When the brain dispatches a verb:
- Look up the target agent’s capability map.
- If native → send
commandframe with verb; agent calls engine API. - If polyfilled → brain executes the verb by:
- Reading the original payload from
tasks+events. - Sending
command: enqueue_taskwith the original payload. - Sending
command: cancel_taskon the old task (if applicable). - Linking old → new via
retried_as.
- Reading the original payload from
Users see one button. The capability difference is invisible except in the agent info drawer (which engines/features the agent supports).
Audit semantics
Section titled “Audit semantics”Every action writes a single audit_log row with action=task.retry / task.cancel / queue.bulk_retry / queue.purge. Polyfilled actions still log as one entry (not three) - the user asked for “retry”, we show “retry” in audit.
See API § tasks for exact endpoints.
Failure behavior
Section titled “Failure behavior”- If the agent is offline, the action queues in
pending_commands(up to 10 per agent) and flushes on reconnect. UI shows “pending - agent offline”. - If the command times out (60s), the action records
error: timeoutand the audit log captures the failure. The task state is not modified. - Polyfilled retries that partially succeed (new task enqueued, old cancel failed) are marked
retried_as+ a warning; the user can manually cancel the old one.
What we don’t do
Section titled “What we don’t do”- No automatic retries - z4j is a control plane, not a retry orchestrator. If you want auto-retry policies, use the engine’s native retry configuration.
- No side-effect-safety guarantees - retrying a task that already half-ran is the user’s call. z4j does not introspect idempotency.