Agent Loop

When an assistant responds to a message, it runs inside the agent loop — a supervised tool-use cycle that talks to the LLM, executes the tools the model asks for, and decides when to finalize a reply. This page covers the resilience mechanisms that keep long or tricky tasks from stalling silently.

If the short version is all you need: assistants automatically retry flaky tool calls, detect when they’re going in circles, compact the conversation when it gets too long, and fall back to asking the human when they run out of context. None of this is configurable per assistant — it applies to every agent loop invocation.

Tool Retries

External tools sometimes fail transiently — a knowledge-base lookup times out, a dataset query hits a temporary connection error, an HTTP tool gets a 503. The agent loop wraps every tool dispatch with automatic retry for these cases.

When a retry happens:

The failure is a transient error — network timeout, connection refused, HTTP 429 rate-limit, or HTTP 5xx
And the tool is marked idempotent — safe to re-invoke with the same arguments

Retries use exponential backoff (0.5s → 2s → 8s) up to three attempts, after which the error is surfaced to the model so it can decide how to recover.

What’s idempotent, what isn’t. Read-only tools — search_knowledge, get_notes, list_*, query_dataset, read_file, and the other lookups — are idempotent by default and will be retried. Side-effecting tools like send_email, save_content, create_task, and update_note are not retried, because retrying a half-applied write could duplicate an email send or create two copies of a deliverable. When one of those fails, the error goes straight to the model and the assistant decides whether to try again explicitly.

Retry activity is logged to the conversation’s tool-call log so you can spot flaky dependencies in the API Logs.

Tool-Loop Detection

When an assistant repeatedly calls the same tool — either with identical arguments or slightly varying ones — the agent loop detects the pattern and intervenes rather than letting it burn iterations.

Two detectors run after every tool batch:

Identical thrash — the same (tool name, arguments) appeared at least 3 times in the last 6 calls. Typically “searched for the same query three times, got the same result.”
Pattern thrash — the same tool name appeared at least 4 times in the last 6 calls with differing arguments. Typically “tried five variations of the same query, nothing found.”

When a pattern fires, the loop injects a message following the escalation ladder: first a soft nudge (“try a different approach or a different tool”), then a hard directive, then a forced ask_human if the pattern keeps repeating.

Every detection emits an agent.loop.detected log event with the tier, tool name, and current escalation level.

Auto-Compaction

Long-running tasks can fill the model’s context window with old tool results, search snippets, and intermediate reasoning — eventually leaving no room for new work. The agent loop tracks an estimate of the conversation’s token footprint and automatically compacts older messages when the estimate crosses 70% of the model’s context window.

Compaction uses a Replace-and-tail with pinned tool results strategy:

The last ten messages are preserved verbatim, so the model’s immediate working context is unchanged
Everything older is summarized into a single message via a cheap LLM call (Haiku by default)
The latest successful result per unique tool is pinned into the summary, so long research tasks don’t forget what they already discovered

Compaction happens silently — no tool the assistant has to invoke, no visible message in the conversation. If the summary LLM call fails for any reason, the loop falls back to a deterministic truncation so work never stalls waiting for compaction.

Auto-compaction replaced the earlier summarize_history tool. Assistants don’t need to decide whether to compact — the loop does it on their behalf whenever the context pressure justifies it.

Recovery Ladder

When compaction alone isn’t enough — a task keeps generating enough context that it re-crosses the threshold quickly — the loop escalates through a three-stage recovery ladder:

Stage	Trigger	What it does
R1 — Auto-compact	First time tokens cross 70% of context	Standard compaction (see above)
R2 — Deep prune	Still over 70% within 20 iterations of R1	Aggressive compaction: keep only the last four messages verbatim, summarize the rest with a “deep pruned” marker
R3 — Ask the human	Still over 70% within 20 iterations of R2	Force an `ask_human` tool call: “I’ve made progress but I’m losing context faster than I can recover. Keep going, change direction, or stop?”

After four total recovery actions in a single loop, further recoveries are skipped to prevent thrash — the loop runs to its iteration or time limit instead. In practice R3 almost always resolves the situation, since the human gives concrete direction about which threads to drop.

Recovery events emit agent.compaction.run (R1) and agent.recovery.fired (R2, R3) telemetry with before/after token counts and message counts.

Time Limits

Agent-running Celery tasks are bounded by a soft time limit of 15 minutes and a hard time limit of 20 minutes:

At the soft limit, the loop catches the signal and immediately runs one final “summarize what you’ve done” turn, so the user gets a partial answer instead of an abrupt disconnect
At the hard limit, the worker is killed — this is a backstop for bugs, not a normal code path

Most conversations finish well inside the soft limit. The limits exist so a stuck loop can’t monopolize a worker and starve the queue for other users.

Cooperative Cancellation

Every agent execution runs as a tracked agent run with a cancel_requested flag. When a user clicks Stop on the chat status panel (or any UI surface that surfaces the cancel control), the runtime flips the flag — it does not kill the worker mid-tool.

The agent loop refreshes the flag at safe boundaries — before each LLM call and after each tool batch — and exits cleanly if cancellation has been requested. Cancellation latency is therefore at most one LLM call or one tool result, in exchange for never killing a worker that’s holding a half-finished side effect.

When the loop observes a cancel, the run transitions to cancelled, no apology message is written, and the channel-side reply is suppressed. Tool results that completed before the boundary are committed; in-flight tools at the boundary itself are not interrupted.

Runs that are paused at waiting_on_human and never observe their own loop are reconciled to cancelled automatically once the cancel-request flag has been set for longer than the cancel timeout (default 3 minutes).

Run Lanes

Every agent execution claims a lane — typically conversation:<id> — and only one run per lane is active at a time. If a user fires two messages in rapid succession on the same conversation, the second message becomes a queued run that starts when the first reaches a terminal state. Different conversations run in parallel; the same conversation serializes.

Child runs (work delegated by spawn_child_runs) get their own lane keyed on the child conversation, so they don’t block the parent’s lane and aren’t blocked by sibling delegations on different conversations.

Checkpoints and Resume

The loop writes a small checkpoint record after each safe boundary — user input persisted, LLM response logged, tool result persisted, after compaction, after journal updates, and before the final response. Each checkpoint stores the iteration counter, message cursor, persisted-message count, system-prompt hash, and the IDs of completed tool calls.

If a worker crashes — process killed, container restart, transient infra failure — the run’s heartbeat goes stale. The reconciler (running every five minutes plus on every worker boot) marks the run timed_out and inspects the latest checkpoint:

Latest checkpoint is llm_response, tool_result, journal_update, or input → metadata.resume_available = true. The run is offered for resume.
Latest checkpoint is human_pause or final, or there is no checkpoint → resume is not eligible. The run terminates as timed_out and a human must intervene.

When a user clicks Resume on the failure panel (or an automated path enqueues resume_agent_run), the loop is rehydrated from the latest checkpoint. The state restoration verifies the system prompt hash hasn’t drifted; if it has (e.g. the assistant was edited mid-failure), the resume halts to waiting_on_human rather than guessing.

Idempotency envelope. Each tool invocation is recorded in an agent_tool_results row keyed by (run_id, tool_call_id). On resume:

Tools whose row has a completed_at are replayed — the stored result is returned without re-invoking the tool. This works for any tool, idempotent or not, as long as the prior call finished.
Tools whose row has started_at but no completed_at indicate a partial mid-flight call. If the tool is idempotent (declared safe to re-invoke), the loop runs it again. If the tool is side-effecting, the loop raises AgentResumeUnsafe and transitions the run to waiting_on_human — the runtime refuses to risk a duplicate send/create/update.

The conservative posture means resume only succeeds when the runtime can prove the next action is safe; ambiguous cases always halt for human review rather than retrying blindly.

Run Journal

Every conversation has a structured run journal — durable working memory the assistant maintains between iterations. Unlike the chat transcript (which is an append-only history of turns), the journal is editable state the assistant writes to deliberately as it makes progress.

The journal has these sections:

Section	What it holds
`objective`	The task the assistant is working on, in its own words
`current_state`	A short paragraph on where the work is right now
`verified_findings`	List entries the assistant has confirmed (e.g. “Tool X is idempotent”)
`decisions`	List of choices made along with reasoning
`artifacts`	Pointers to deliverables produced (file paths, doc IDs)
`open_questions`	List of things the assistant flagged as unresolved
`blockers`	List of things stopping forward progress
`handoff_next_action`	One-line “what should happen next if I stop here”

The assistant maintains the journal via three tools: get_run_journal (read), update_run_journal (top-level fields), and append_run_journal_entry (list sections).

The journal is injected into the assistant’s context two ways on every continuation:

Compact digest in the system prompt — objective, current state, first blocker, next action, and the most recent verified findings. Always present, ~500 characters.
Full rendered journal as a user/assistant pair before the live messages — the complete Markdown of every section, sitting just after any history summary. This is what survives iteration-limit halts: when a turn hits MAX_AGENT_ITERATIONS and the next user message starts a fresh continuation, the rebuilt message stream still carries everything the assistant had written down. Without this, the resumed run would re-discover findings the previous turn had already confirmed.

When a run reaches a terminal failure state, the journal is what powers the recovery handoff panel: the operator sees the assistant’s last objective, what state it had reached, and what it suggested as the next action.

Observability

Every resilience mechanism emits structured log events so you can watch them on dashboards and in the API Logs view. Look for:

Event	Fired when
`agent.tool.retry`	A tool execution was retried after a transient failure
`agent.loop.detected`	Tool-loop detection escalated (tier and level in the payload)
`agent.guardrail.escalated`	A sync guardrail advanced up the escalation ladder
`agent.compaction.run`	Auto-compaction fired (R1), with before/after token counts
`agent.recovery.fired`	A deeper recovery stage fired (R2 deep prune or R3 ask_human)
`agent.exhaustion.celery_soft`	Celery soft time limit fired; graceful summary turn was triggered
`agent_run.started`	Worker claimed a run from the lane and began executing
`agent_run.completed`	Run reached terminal `completed` state
`agent_run.cancelled`	Run observed `cancel_requested` at a safe boundary and exited
`agent_run.failed`	Run reached terminal `failed` state (full traceback in the log)
`agent_run.resume_unsafe`	Resume halted to `waiting_on_human` because the next action could not be proved safe
`agent_run.reconcile`	Periodic reconciler ran (counts of `timed_out` / `failed` / `cancelled` runs in the payload)
`agent_run.dispatch_next_failed`	Lane drain failed to dispatch the next queued run; logged but does not block the lane indefinitely (the next enqueue picks up)

Frequent escalations or recoveries on a particular assistant usually indicate a design problem — the personality prompt asking the model to do something it can’t, a flaky external tool, or a knowledge base that’s too sparse. The telemetry is meant to make these problems visible rather than hidden behind silent failures.

Technical Details

Token estimator — The running token count uses tiktoken’s cl100k_base encoding as a proxy across all providers. It’s not exactly accurate for non-Anthropic models but is more than precise enough to drive a 70% threshold. Callers that need exact counts query the provider’s API.

Pinning tool results by name — When compaction runs, it walks the head of the conversation building a tool_use_id → tool_name map from assistant turns, then finds the most recent successful tool_result block per tool name in the user turns. Errors ("Error executing ..." prefix) are skipped — pinning a failed result would give the next turn a misleading “this is what you know” signal. The pinned results are inlined into the synthetic summary message as plain text, not re-injected as tool_result blocks (which would reference tool_use blocks that no longer exist).

Force-tool via tool_choice — When escalation reaches L2 or L3, or when the recovery ladder reaches R3, the next LLM call is issued with Anthropic’s tool_choice constraint. This forces the model to call the named tool (search_knowledge, ask_human) as its next action rather than optionally. Providers that don’t support tool_choice (e.g. Ollama) ignore the constraint.

State lives on LoopState — Per-loop counters (compactions_used, prunes_used, recoveries_used_total, last_compaction_iteration, last_prune_iteration) live on the loop’s state object. They reset for every new conversation turn, so recovery history doesn’t carry across turns. Tool-call retry counters are per-call and scoped by the Anthropic tool_use_id, so the same tool invoked twice in one turn gets two independent retry budgets.

Prompts Chat