Agent Runs
Every agent execution — chat reply, web-trigger result, scheduled task, child run, inbound email/Slack/Telegram, playground turn, human-review resume — is tracked as a durable agent run with a lifecycle that operators can observe and act on. This page covers the operator-facing surfaces: status panel, live event stream, and the recovery handoff.
If you only need the user-facing view: every chat surface shows a status panel above the input that reflects whether work is queued, running, waiting on a tool, paused on human input, or terminal. When a run fails, the panel offers a Resume button if the runtime can prove the next action is safe.
Run Statuses
| Status | Meaning |
|---|---|
queued | Enqueued, not yet picked up by a worker |
waiting_on_lane | Another run is active on the same conversation; waiting for it to terminate |
running | Worker has claimed the run and is executing |
waiting_on_tool | A tool is in flight |
waiting_on_child | The run delegated work to a child agent and is waiting on it |
waiting_on_human | Paused indefinitely waiting on ask_human or after a recoverable failure that needs review |
resuming | A resume_agent_run task picked up a timed_out run and is rehydrating loop state |
cancel_requested | Cancel flag set; the loop will exit at the next safe boundary |
cancelled | Loop exited cleanly after observing the cancel flag |
completed | Final response delivered |
failed | An exception terminated the run; error_message captures the details |
timed_out | Heartbeat went stale and the reconciler swept the run; resume may be available |
completed, failed, cancelled, and timed_out are terminal — once a run reaches one of these, the current_checkpoint_id and any tool result records remain available for inspection, but no further loop activity is possible without a new run or a manual resume.
Status Panel
Every chat workspace polls GET /conversations/<conversation_id>/agent-status every 1.5 seconds via HTMX while a run is non-terminal. The endpoint renders a small Bootstrap panel that shows:
- The current status label and any short detail (e.g.
claimed lane,heartbeat timeout) - A Cancel button when the run is in
running,waiting_on_tool, orwaiting_on_child - The recovery handoff (see below) when the run is
failed,timed_out, orcancelled
Clicking Cancel posts to POST /conversations/<conversation_id>/agent-runs/cancel, which sets cancel_requested=true on the active run and re-renders the panel. The loop will exit at its next LLM-call or tool-batch boundary; the run reaches cancelled shortly after.
Live Event Stream
For dashboards or any caller that wants live progress without polling, an SSE endpoint streams every lifecycle event as it lands:
GET /conversations/<conversation_id>/agent-events
?after_event_id=N # default 0; works on both run and conversation channels
&after_sequence=N # default 0; manual cursor for run channels only
&run_id=X # filter to a specific run; default streams the whole conversation
&visibility=V # user | operator | internal; default operatorEach event is emitted as id: <AgentEvent.id>\n followed by data: {...}\n\n, so the browser’s EventSource automatically replays the cursor on reconnect via the Last-Event-ID header. Heartbeat frames (: heartbeat\n\n) fire every 15 seconds.
Event payload:
{
"id": 12345,
"event_type": "run_status_changed",
"run_id": 17,
"conversation_id": 42,
"sequence": 4,
"visibility": "operator",
"payload": {"from": "running", "to": "completed", "detail": null},
"created_at": "2026-04-15T10:01:23+00:00"
}The id field carries the globally-monotonic AgentEvent.id — same value the SSE id: line emits and the same value Last-Event-ID echoes back on reconnect.
Visibility levels:
user— events the end-user sees in the chat UI (status changes, terminal events)operator(default) —userevents plus tool start/finish, journal updates, child-run notificationsinternal— operator events pluscheckpoint_createdand other infra-level signals
Cursors. SSE reconnect is standardised on AgentEvent.id for both channel types — that’s what the endpoint emits and what the browser replays via Last-Event-ID. Cursor priority on a request:
Last-Event-IDheader (browserEventSourcereconnect) — used for both run and conversation channels.?after_event_id=Nquery param — explicit manual cursor; works on both channel types.?after_sequence=Nquery param — manual per-run cursor for programmatic clients that prefer the per-run monotonic counter; ignored on conversation channels because per-runsequenceis not unique across runs on the same conversation.
Web clients don’t need to track or re-pass any cursor manually — the browser handles the reconnect via Last-Event-ID. Programmatic clients that drive their own reconnect should track id and pass it as after_event_id= on the next request.
The endpoint requires an authenticated session — the same check used by chat surfaces. It is not yet exposed for unauthenticated Bearer-token integrations; web-trigger callers should use the callback delivery for terminal notifications.
Recovery Handoff
When a run reaches failed, timed_out, or cancelled, the status panel renders a recovery handoff:
- The terminal status label (
Run failed,Run timed out,Run cancelled) - The
error_messageif any - A collapsible Last known state block with the structured run journal digest — objective, current state, blockers, next action, recent verified findings
- The journal’s
handoff_next_actionline, when set, surfaced prominently - A Resume from last checkpoint button when
metadata.resume_availableistrue
Clicking the resume button posts to POST /agent-runs/<run_key>/resume, which enqueues resume_agent_run.delay(run_id). The Celery worker re-claims the lane (transitioning the run to resuming), loads the latest checkpoint, and continues execution. Tool calls that completed before the failure are replayed from the idempotency envelope without being re-invoked.
Cancelled runs surface the journal panel for context but never get a resume button — cancel was a deliberate user action and is not auto-undone.
Reconciler
A Celery beat task runs every five minutes and is also triggered on every worker boot. It performs three passes:
- Heartbeat timeout (default 5 minutes) — runs in
running,waiting_on_tool,waiting_on_child, orresumingwhoselast_heartbeat_atis stale →timed_out. Eligibility for resume is set based on the latest checkpoint type (llm_response,tool_result,journal_update, andinputare eligible;human_pause,final, and “no checkpoint” are not). - Queue timeout (default 1 hour) — runs in
queuedorwaiting_on_lanewhosecreated_atis older than the threshold →failedwith detailqueue timeout. Catches lane drains that lost their dispatcher. - Stale cancel (default 3 minutes) — runs whose
cancel_requested=trueis older than the threshold →cancelled. Catcheswaiting_on_humanruns that the user cancelled but no loop is around to observe.
waiting_on_human runs are never reconciled by the heartbeat pass; they are paused indefinitely by design.
The reconciler is idempotent — running it twice on the same set of runs produces the same outcome.
Lane Semantics
Each run carries a lane_key. Only one run per lane may be in a non-terminal active state (running, waiting_on_tool, waiting_on_child, resuming) at a time. Lane defaults:
| Surface | Lane key | Notes |
|---|---|---|
| Chat (admin / staff / public) | conversation:<id> | Two rapid sends serialize on the conversation |
| Web trigger | conversation:<id> | Per-trigger conversation |
| Email reply | conversation:<id> | |
| Email / Slack / Telegram ad-hoc | adhoc:<channel>:<sender_key> | Same sender on the same channel serializes; different senders run in parallel |
| Inbox item / reply | inbox_item:<id> | Replies on the same item serialize; different items run in parallel |
| Scheduled task | task_run:<id> | Each schedule firing creates a fresh TaskRun and runs in its own lane |
| Child run | child:<conversation>:<assistant>:<namespace> | Children don’t block their parent’s lane |
| Human-review resume | conversation:<id> | Resumes wait on whatever’s active in the parent conversation |
Lane claims use a blocking SELECT ... FOR UPDATE on the candidate queued row, with a re-check after the lock acquires. Concurrent workers contending on the same lane serialize cleanly; different lanes run fully in parallel.
Inspecting a Run from the CLI
The Debug CLI is the primary tool for dumping an agent run plus all its events, tool results, checkpoints, and child runs:
docker exec teamwebai-web-1 uv run flask debug conversation <conversation_id> --childrenFor ad-hoc queries during incidents, the relevant tables are:
agent_runs— one row per execution, indexed by(conversation_id, created_at)and(lane_key, status)agent_events— append-only event log, monotonicsequenceper runagent_checkpoints— one row per safe boundary, ordered by per-runsequenceagent_tool_results— idempotency envelope, unique on(run_id, tool_call_id)agent_run_journals— structured journal, one row per conversation