Unified action surface

The problem

Every engine implements “retry” and “cancel” differently:

Celery - task.retry() from inside the task, or re-apply with app.send_task.
RQ - Queue.requeue(job_id), or job.requeue().
Dramatiq - no first-class retry API; middleware-based.
Huey - retries are configured per-task; no ad-hoc retry surface.
arq - no retry API; re-enqueue manually.
taskiq - per-task retry decorators.

Surfacing these differences to users ruins the value proposition. A dashboard where “Retry” does different things on different engines is not a control plane.

The z4j answer

Unified verbs. Four actions:

Verb	Semantics
`retry`	Re-enqueue one task with its original payload. Mark the old record as `retried_as=<new_id>`.
`cancel`	Best-effort cancel if not yet running; mark as `cancelled` if already running (worker decides).
`bulk_retry`	Apply `retry` to a filter set (state + name + time window), capped at 10k per call.
`purge_queue`	Drop all pending messages on a named queue (dangerous; requires `operator+` role).

Native vs. polyfilled

Each engine adapter advertises its capability map in the hello frame:

"celery": {
  "native_retry": true,
  "native_cancel": true,
  "bulk_retry": false,   // polyfilled
  "purge": true
}

When z4j dispatches a verb:

Look up the target agent’s capability map.
If native → send command frame with verb; agent calls engine API.
If polyfilled → brain executes the verb by:
- Reading the original payload from tasks + events.
- Sending command: enqueue_task with the original payload.
- Sending command: cancel_task on the old task (if applicable).
- Linking old → new via retried_as.

Users see one button. The capability difference is invisible except in the agent info drawer (which engines/features the agent supports).

Audit semantics

Every action writes a single audit_log row with action=task.retry / task.cancel / queue.bulk_retry / queue.purge. Polyfilled actions still log as one entry (not three) - the user asked for “retry”, we show “retry” in audit.

See API § tasks for exact endpoints.

Failure behavior

If the agent is offline, the action queues in pending_commands (up to 10 per agent) and flushes on reconnect. UI shows “pending - agent offline”.
If the command times out (60s), the action records error: timeout and the audit log captures the failure. The task state is not modified.
Polyfilled retries that partially succeed (new task enqueued, old cancel failed) are marked retried_as + a warning; the user can manually cancel the old one.

What we don’t do

No automatic retries - z4j is a control plane, not a retry orchestrator. If you want auto-retry policies, use the engine’s native retry configuration.
No side-effect-safety guarantees - retrying a task that already half-ran is the user’s call. z4j does not introspect idempotency.