Reconciliation

The stuck-task problem

If a worker crashes mid-task, the engine may emit task_started but never task_success or task_failure. Without reconciliation, the dashboard shows “running” forever.

The reconciliation worker

Runs in z4j every Z4J_RECONCILIATION_SWEEP_SECONDS (default 5 minutes). For each agent:

Find all tasks in state started older than Z4J_RECONCILIATION_STALE_THRESHOLD_SECONDS (default 15 minutes).
Ask the agent via a command: query_state frame: “is task X still running?”
Agent asks its engine: Celery AsyncResult.status, RQ job.get_status(), etc.
If the engine says “unknown / not found” → mark task reconciled_as=lost. Emit an audit log entry.
If the engine says “still running” → extend the threshold, check again next pass.
If the engine says “succeeded / failed” → record the outcome and emit the missing event.

Why not rely on the engine alone

Each engine reports differently:

Celery - AsyncResult is Redis-TTL-bound; after expiry it returns PENDING regardless.
RQ - cleanly reports lost, but requires periodic cleanup for stale jobs.
arq - no introspection API for historical jobs; relies on application-level result store.

The reconciliation worker owns the “was it really lost?” question so no adapter has to.

Tunables

Setting	Default	Meaning
`Z4J_RECONCILIATION_SWEEP_SECONDS`	`300`	Seconds between reconciliation passes
`Z4J_RECONCILIATION_STALE_THRESHOLD_SECONDS`	`900`	How old a “started” task must be before the worker queries the agent

Audit impact

Every reconciliation that flips a task to lost writes an audit log entry with action=task.reconciled_lost and the reasoning (engine response). Auditable and explainable.

Limits

Reconciliation does not retry lost tasks. That’s a user decision - surfaced in the UI as “you have 14 lost tasks” with a bulk-retry button.
If the agent is offline, reconciliation skips that agent and retries next pass.
No retroactive reconciliation across deploys - only tasks still present in tasks + events are examined.