Skip to content

Unified action surface

Every engine implements “retry” and “cancel” differently:

  • Celery - task.retry() from inside the task, or re-apply with app.send_task.
  • RQ - Queue.requeue(job_id), or job.requeue().
  • Dramatiq - no first-class retry API; middleware-based.
  • Huey - retries are configured per-task; no ad-hoc retry surface.
  • arq - no retry API; re-enqueue manually.
  • taskiq - per-task retry decorators.

Surfacing these differences to users ruins the value proposition. A dashboard where “Retry” does different things on different engines is not a control plane.

Unified verbs. Four actions:

VerbSemantics
retryRe-enqueue one task with its original payload. Mark the old record as retried_as=<new_id>.
cancelBest-effort cancel if not yet running; mark as cancelled if already running (worker decides).
bulk_retryApply retry to a filter set (state + name + time window), capped at 10k per call.
purge_queueDrop all pending messages on a named queue (dangerous; requires operator+ role).

Each engine adapter advertises its capability map in the hello frame:

"celery": {
"native_retry": true,
"native_cancel": true,
"bulk_retry": false, // polyfilled
"purge": true
}

When the brain dispatches a verb:

  1. Look up the target agent’s capability map.
  2. If native → send command frame with verb; agent calls engine API.
  3. If polyfilled → brain executes the verb by:
    • Reading the original payload from tasks + events.
    • Sending command: enqueue_task with the original payload.
    • Sending command: cancel_task on the old task (if applicable).
    • Linking old → new via retried_as.

Users see one button. The capability difference is invisible except in the agent info drawer (which engines/features the agent supports).

Every action writes a single audit_log row with action=task.retry / task.cancel / queue.bulk_retry / queue.purge. Polyfilled actions still log as one entry (not three) - the user asked for “retry”, we show “retry” in audit.

See API § tasks for exact endpoints.

  • If the agent is offline, the action queues in pending_commands (up to 10 per agent) and flushes on reconnect. UI shows “pending - agent offline”.
  • If the command times out (60s), the action records error: timeout and the audit log captures the failure. The task state is not modified.
  • Polyfilled retries that partially succeed (new task enqueued, old cancel failed) are marked retried_as + a warning; the user can manually cancel the old one.
  • No automatic retries - z4j is a control plane, not a retry orchestrator. If you want auto-retry policies, use the engine’s native retry configuration.
  • No side-effect-safety guarantees - retrying a task that already half-ran is the user’s call. z4j does not introspect idempotency.