Skip to content

Agent Runs

Every agent execution — chat reply, web-trigger result, scheduled task, child run, inbound email/Slack/Telegram, playground turn, human-review resume — is tracked as a durable agent run with a lifecycle that operators can observe and act on. This page covers the operator-facing surfaces: status panel, live event stream, and the recovery handoff.

If you only need the user-facing view: every chat surface shows a status panel above the input that reflects whether work is queued, running, waiting on a tool, paused on human input, or terminal. When a run fails, the panel offers a Resume button if the runtime can prove the next action is safe.

Run Statuses

StatusMeaning
queuedEnqueued, not yet picked up by a worker
waiting_on_laneAnother run is active on the same conversation; waiting for it to terminate
runningWorker has claimed the run and is executing
waiting_on_toolA tool is in flight
waiting_on_childThe run delegated work to a child agent and is waiting on it
waiting_on_humanPaused indefinitely waiting on ask_human or after a recoverable failure that needs review
resumingA resume_agent_run task picked up a timed_out run and is rehydrating loop state
cancel_requestedCancel flag set; the loop will exit at the next safe boundary
cancelledLoop exited cleanly after observing the cancel flag
completedFinal response delivered
failedAn exception terminated the run; error_message captures the details
timed_outHeartbeat went stale and the reconciler swept the run; resume may be available

completed, failed, cancelled, and timed_out are terminal — once a run reaches one of these, the current_checkpoint_id and any tool result records remain available for inspection, but no further loop activity is possible without a new run or a manual resume.

Status Panel

Every chat workspace polls GET /conversations/<conversation_id>/agent-status every 1.5 seconds via HTMX while a run is non-terminal. The endpoint renders a small Bootstrap panel that shows:

  • The current status label and any short detail (e.g. claimed lane, heartbeat timeout)
  • A Cancel button when the run is in running, waiting_on_tool, or waiting_on_child
  • The recovery handoff (see below) when the run is failed, timed_out, or cancelled

Clicking Cancel posts to POST /conversations/<conversation_id>/agent-runs/cancel, which sets cancel_requested=true on the active run and re-renders the panel. The loop will exit at its next LLM-call or tool-batch boundary; the run reaches cancelled shortly after.

Live Event Stream

For dashboards or any caller that wants live progress without polling, an SSE endpoint streams every lifecycle event as it lands:

GET /conversations/<conversation_id>/agent-events
    ?after_event_id=N        # default 0; works on both run and conversation channels
    &after_sequence=N        # default 0; manual cursor for run channels only
    &run_id=X                # filter to a specific run; default streams the whole conversation
    &visibility=V            # user | operator | internal; default operator

Each event is emitted as id: <AgentEvent.id>\n followed by data: {...}\n\n, so the browser’s EventSource automatically replays the cursor on reconnect via the Last-Event-ID header. Heartbeat frames (: heartbeat\n\n) fire every 15 seconds.

Event payload:

{
  "id": 12345,
  "event_type": "run_status_changed",
  "run_id": 17,
  "conversation_id": 42,
  "sequence": 4,
  "visibility": "operator",
  "payload": {"from": "running", "to": "completed", "detail": null},
  "created_at": "2026-04-15T10:01:23+00:00"
}

The id field carries the globally-monotonic AgentEvent.id — same value the SSE id: line emits and the same value Last-Event-ID echoes back on reconnect.

Visibility levels:

  • user — events the end-user sees in the chat UI (status changes, terminal events)
  • operator (default) — user events plus tool start/finish, journal updates, child-run notifications
  • internal — operator events plus checkpoint_created and other infra-level signals

Cursors. SSE reconnect is standardised on AgentEvent.id for both channel types — that’s what the endpoint emits and what the browser replays via Last-Event-ID. Cursor priority on a request:

  1. Last-Event-ID header (browser EventSource reconnect) — used for both run and conversation channels.
  2. ?after_event_id=N query param — explicit manual cursor; works on both channel types.
  3. ?after_sequence=N query param — manual per-run cursor for programmatic clients that prefer the per-run monotonic counter; ignored on conversation channels because per-run sequence is not unique across runs on the same conversation.

Web clients don’t need to track or re-pass any cursor manually — the browser handles the reconnect via Last-Event-ID. Programmatic clients that drive their own reconnect should track id and pass it as after_event_id= on the next request.

The endpoint requires an authenticated session — the same check used by chat surfaces. It is not yet exposed for unauthenticated Bearer-token integrations; web-trigger callers should use the callback delivery for terminal notifications.

Recovery Handoff

When a run reaches failed, timed_out, or cancelled, the status panel renders a recovery handoff:

  • The terminal status label (Run failed, Run timed out, Run cancelled)
  • The error_message if any
  • A collapsible Last known state block with the structured run journal digest — objective, current state, blockers, next action, recent verified findings
  • The journal’s handoff_next_action line, when set, surfaced prominently
  • A Resume from last checkpoint button when metadata.resume_available is true

Clicking the resume button posts to POST /agent-runs/<run_key>/resume, which enqueues resume_agent_run.delay(run_id). The Celery worker re-claims the lane (transitioning the run to resuming), loads the latest checkpoint, and continues execution. Tool calls that completed before the failure are replayed from the idempotency envelope without being re-invoked.

Cancelled runs surface the journal panel for context but never get a resume button — cancel was a deliberate user action and is not auto-undone.

Reconciler

A Celery beat task runs every five minutes and is also triggered on every worker boot. It performs three passes:

  1. Heartbeat timeout (default 5 minutes) — runs in running, waiting_on_tool, waiting_on_child, or resuming whose last_heartbeat_at is stale → timed_out. Eligibility for resume is set based on the latest checkpoint type (llm_response, tool_result, journal_update, and input are eligible; human_pause, final, and “no checkpoint” are not).
  2. Queue timeout (default 1 hour) — runs in queued or waiting_on_lane whose created_at is older than the threshold → failed with detail queue timeout. Catches lane drains that lost their dispatcher.
  3. Stale cancel (default 3 minutes) — runs whose cancel_requested=true is older than the threshold → cancelled. Catches waiting_on_human runs that the user cancelled but no loop is around to observe.

waiting_on_human runs are never reconciled by the heartbeat pass; they are paused indefinitely by design.

The reconciler is idempotent — running it twice on the same set of runs produces the same outcome.

Lane Semantics

Each run carries a lane_key. Only one run per lane may be in a non-terminal active state (running, waiting_on_tool, waiting_on_child, resuming) at a time. Lane defaults:

SurfaceLane keyNotes
Chat (admin / staff / public)conversation:<id>Two rapid sends serialize on the conversation
Web triggerconversation:<id>Per-trigger conversation
Email replyconversation:<id>
Email / Slack / Telegram ad-hocadhoc:<channel>:<sender_key>Same sender on the same channel serializes; different senders run in parallel
Inbox item / replyinbox_item:<id>Replies on the same item serialize; different items run in parallel
Scheduled tasktask_run:<id>Each schedule firing creates a fresh TaskRun and runs in its own lane
Child runchild:<conversation>:<assistant>:<namespace>Children don’t block their parent’s lane
Human-review resumeconversation:<id>Resumes wait on whatever’s active in the parent conversation

Lane claims use a blocking SELECT ... FOR UPDATE on the candidate queued row, with a re-check after the lock acquires. Concurrent workers contending on the same lane serialize cleanly; different lanes run fully in parallel.

Inspecting a Run from the CLI

The Debug CLI is the primary tool for dumping an agent run plus all its events, tool results, checkpoints, and child runs:

docker exec teamwebai-web-1 uv run flask debug conversation <conversation_id> --children

For ad-hoc queries during incidents, the relevant tables are:

  • agent_runs — one row per execution, indexed by (conversation_id, created_at) and (lane_key, status)
  • agent_events — append-only event log, monotonic sequence per run
  • agent_checkpoints — one row per safe boundary, ordered by per-run sequence
  • agent_tool_results — idempotency envelope, unique on (run_id, tool_call_id)
  • agent_run_journals — structured journal, one row per conversation