Skip to content

Reconciliation

If a worker crashes mid-task, the engine may emit task_started but never task_success or task_failure. Without reconciliation, the dashboard shows “running” forever.

Runs in the brain, once per minute. For each agent:

  1. Find all tasks in state started older than reconcile_threshold (default 30 minutes).
  2. Ask the agent via a command: query_state frame: “is task X still running?”
  3. Agent asks its engine: Celery AsyncResult.status, RQ job.get_status(), etc.
  4. If the engine says “unknown / not found” → mark task reconciled_as=lost. Emit an audit log entry.
  5. If the engine says “still running” → extend the threshold, check again next pass.
  6. If the engine says “succeeded / failed” → record the outcome and emit the missing event.

Each engine reports differently:

  • Celery - AsyncResult is Redis-TTL-bound; after expiry it returns PENDING regardless.
  • RQ - cleanly reports lost, but requires periodic cleanup for stale jobs.
  • arq - no introspection API for historical jobs; relies on application-level result store.

The reconciliation worker owns the “was it really lost?” question so no adapter has to.

SettingDefaultMeaning
Z4J_RECONCILE_ENABLEDtrueTurn the worker off entirely
Z4J_RECONCILE_INTERVAL60Seconds between passes
Z4J_RECONCILE_THRESHOLD_SECONDS1800How old a “started” task must be before we query
Z4J_RECONCILE_MAX_PER_PASS500Cap queries per pass per agent

Every reconciliation that flips a task to lost writes an audit log entry with action=task.reconciled_lost and the reasoning (engine response). Auditable and explainable.

  • Reconciliation does not retry lost tasks. That’s a user decision - surfaced in the UI as “you have 14 lost tasks” with a bulk-retry button.
  • If the agent is offline, reconciliation skips that agent and retries next pass.
  • No retroactive reconciliation across deploys - only tasks still present in tasks + events are examined.