Skip to content

Reconciliation

If a worker crashes mid-task, the engine may emit task_started but never task_success or task_failure. Without reconciliation, the dashboard shows “running” forever.

Runs in z4j every Z4J_RECONCILIATION_SWEEP_SECONDS (default 5 minutes). For each agent:

  1. Find all tasks in state started older than Z4J_RECONCILIATION_STALE_THRESHOLD_SECONDS (default 15 minutes).
  2. Ask the agent via a command: query_state frame: “is task X still running?”
  3. Agent asks its engine: Celery AsyncResult.status, RQ job.get_status(), etc.
  4. If the engine says “unknown / not found” → mark task reconciled_as=lost. Emit an audit log entry.
  5. If the engine says “still running” → extend the threshold, check again next pass.
  6. If the engine says “succeeded / failed” → record the outcome and emit the missing event.

Each engine reports differently:

  • Celery - AsyncResult is Redis-TTL-bound; after expiry it returns PENDING regardless.
  • RQ - cleanly reports lost, but requires periodic cleanup for stale jobs.
  • arq - no introspection API for historical jobs; relies on application-level result store.

The reconciliation worker owns the “was it really lost?” question so no adapter has to.

SettingDefaultMeaning
Z4J_RECONCILIATION_SWEEP_SECONDS300Seconds between reconciliation passes
Z4J_RECONCILIATION_STALE_THRESHOLD_SECONDS900How old a “started” task must be before the worker queries the agent

Every reconciliation that flips a task to lost writes an audit log entry with action=task.reconciled_lost and the reasoning (engine response). Auditable and explainable.

  • Reconciliation does not retry lost tasks. That’s a user decision - surfaced in the UI as “you have 14 lost tasks” with a bulk-retry button.
  • If the agent is offline, reconciliation skips that agent and retries next pass.
  • No retroactive reconciliation across deploys - only tasks still present in tasks + events are examined.