Reconciliation
The stuck-task problem
Section titled “The stuck-task problem”If a worker crashes mid-task, the engine may emit task_started but never task_success or task_failure. Without reconciliation, the dashboard shows “running” forever.
The reconciliation worker
Section titled “The reconciliation worker”Runs in the brain, once per minute. For each agent:
- Find all tasks in state
startedolder thanreconcile_threshold(default 30 minutes). - Ask the agent via a
command: query_stateframe: “is task X still running?” - Agent asks its engine: Celery
AsyncResult.status, RQjob.get_status(), etc. - If the engine says “unknown / not found” → mark task
reconciled_as=lost. Emit an audit log entry. - If the engine says “still running” → extend the threshold, check again next pass.
- If the engine says “succeeded / failed” → record the outcome and emit the missing event.
Why not rely on the engine alone
Section titled “Why not rely on the engine alone”Each engine reports differently:
- Celery -
AsyncResultis Redis-TTL-bound; after expiry it returnsPENDINGregardless. - RQ - cleanly reports lost, but requires periodic cleanup for stale jobs.
- arq - no introspection API for historical jobs; relies on application-level result store.
The reconciliation worker owns the “was it really lost?” question so no adapter has to.
Tunables
Section titled “Tunables”| Setting | Default | Meaning |
|---|---|---|
Z4J_RECONCILE_ENABLED | true | Turn the worker off entirely |
Z4J_RECONCILE_INTERVAL | 60 | Seconds between passes |
Z4J_RECONCILE_THRESHOLD_SECONDS | 1800 | How old a “started” task must be before we query |
Z4J_RECONCILE_MAX_PER_PASS | 500 | Cap queries per pass per agent |
Audit impact
Section titled “Audit impact”Every reconciliation that flips a task to lost writes an audit log entry with action=task.reconciled_lost and the reasoning (engine response). Auditable and explainable.
Limits
Section titled “Limits”- Reconciliation does not retry lost tasks. That’s a user decision - surfaced in the UI as “you have 14 lost tasks” with a bulk-retry button.
- If the agent is offline, reconciliation skips that agent and retries next pass.
- No retroactive reconciliation across deploys - only tasks still present in
tasks+eventsare examined.