Skip to content

Guardrails

Guardrails are configurable quality controls that monitor and enforce how an assistant responds. Each guardrail can be independently enabled per assistant, with its own enforcement mode and settings.

Configure guardrails from the Guardrails tab on the assistant detail page.

Enforcement Modes

Each guardrail supports one or both of these modes:

  • Log only — Record violations for review in analytics without affecting the assistant’s behavior. Use this to measure quality before enforcing.
  • Nudge — When a violation is detected, inject a corrective message that asks the assistant to try again. The assistant’s original response is intercepted before delivery. A sync guardrail in nudge mode uses a three-step escalation ladder so repeated ignores result in progressively stronger steering rather than being silently dropped.

Escalation Ladder

When a sync guardrail in Nudge mode fires and the assistant ignores the correction, the guardrail escalates through three levels on subsequent fires within the same turn:

LevelAction
L1 — Soft nudgeA corrective user-style message is injected (“please search the knowledge base before answering”). The assistant decides freely how to respond.
L2 — Hard directiveA stronger system-style message is injected (“You MUST call search_knowledge before answering”) and the next LLM call is constrained with tool_choice to force the relevant tool (e.g. search_knowledge for knowledge grounding).
L3 — Human handoffA forced ask_human tool call is issued with a question explaining that the assistant couldn’t meet the guardrail and asking the user how to proceed.

This replaces the earlier fire-once behavior where a second fire of the same guardrail was silently dropped, leading to ungrounded answers shipping. The ladder ensures every repeated violation either produces a corrected response or an explicit human check-in.

Trust the deflection. The knowledge grounding guardrail preserves an important exception at L1: if the assistant is nudged to search but chooses not to (e.g. the question is genuinely out of scope), the original text is returned as the final answer rather than escalating. Once escalation reaches L2 or L3, the hard directive takes over — at that point the model has explicitly been told to search and still hasn’t, so the original is no longer considered a correct deflection.

Knowledge Grounding

Checks whether the assistant searched the knowledge base before making factual claims. When enabled, this guardrail also injects knowledge boundary instructions into the system prompt, restricting the assistant to answer only from its knowledge base.

This replaces the old “Knowledge Only” mode with a more flexible approach – you can combine knowledge grounding with any assistant mode and fine-tune the enforcement level.

How it works:

  1. After the assistant responds, the guardrail classifies the response type (factual claim vs. conversational)
  2. Conversational responses — greetings, clarifications, follow-up questions — are never penalized
  3. If the response contains factual claims and no knowledge tool was called, the guardrail fires
  4. In nudge mode, the response is intercepted and the escalation ladder advances by one step: first a soft nudge, then a hard search_knowledge directive, finally a forced ask_human escalation
  5. In log only mode, the event is recorded for analytics without affecting the response

Settings:

  • Knowledge Boundary Instructions — Custom rules injected into the system prompt when this guardrail is enabled. Leave empty for sensible defaults that restrict the assistant to answer from its knowledge base only. These instructions tell the assistant to always search before answering, to say “I don’t know” when the knowledge base has no answer, and to decline off-topic requests.
The boundary instructions are injected into the system prompt regardless of enforcement mode. Even in “log only” mode, the assistant receives instructions to search first — the enforcement mode only controls what happens if it doesn’t.

Faithfulness

Evaluates whether the assistant’s factual claims are actually supported by the context it retrieved. This catches the case where the assistant searched the knowledge base but then ignored the results and made something up.

Runs asynchronously after each response — it does not add latency to the user’s experience. Uses Claude Haiku as a binary judge (PASS/FAIL) with a one-sentence reasoning explanation.

Settings:

  • Enforcement mode — Log only (record scores) or Nudge (hint to the assistant on the next turn when it fails)
  • Sample rate — What percentage of responses to evaluate. Default is 100% for nudge mode, 20% for log only. Higher rates give better data but cost more.
  • Evaluation model — Which model to use for judging. Default is Claude Haiku 4.5 (fastest and cheapest). Claude Sonnet 4.6 is available for more nuanced evaluation.

Instruction Adherence

Periodically evaluates whether the assistant is following its personality prompt and configured instructions. This is a trend metric for dashboards — it runs as log-only and is not used as a real-time gate.

Useful for detecting drift over time: is the assistant staying in character? Is it respecting scope boundaries? Is it following explicit rules in the personality prompt?

Settings:

  • Sample rate — Default 10%. This is a trend metric, so you don’t need to evaluate every response.
  • Evaluation model — Default Claude Haiku 4.5.
  • Custom rubric — Optional override for what the evaluator checks against. If empty, the assistant’s personality prompt is used as the evaluation criteria.

Metrics

The guardrails tab shows a summary bar with metrics from the last 7 days:

  • Grounding nudges — How many times the knowledge grounding guardrail fired
  • Faithfulness pass rate — Percentage of evaluated responses that passed the faithfulness check
  • Adherence pass rate — Percentage of evaluated responses that followed instructions
  • Eval cost — Total cost of evaluation LLM calls

Per-message evaluation results are also visible as small badges on individual messages in the conversation view (admin users only). Click the badge to see the evaluator’s reasoning.

Aggregate guardrail metrics are available in the Analytics dashboard.

Technical Details

Sync vs. async guardrails — Knowledge grounding runs synchronously inside the agent loop (it can intercept and redirect responses). Faithfulness and instruction adherence run asynchronously as Celery background tasks after the response is stored — they score quality without blocking the user.

Evaluation cost — Each faithfulness or adherence evaluation makes one LLM call to Claude Haiku (~$0.002 per evaluation). At default sample rates, a moderately active assistant (100 messages/day) costs roughly $0.20/day for faithfulness checks and $0.02/day for adherence checks.

Escalation state — Each sync guardrail tracks its ladder level per agent-loop invocation. The level resets when the loop returns, so a fresh conversation turn starts at L0 again. At L1, the “trust the deflection” fallback resets the ladder to L0 if the model’s follow-up doesn’t repeat the ungrounded claim — this prevents the ladder from escalating on a correctly-handled off-topic question.

Emitted telemetry — Every escalation emits an agent.guardrail.escalated log event with the guardrail type, the new level, and the forced tool (if any). These feed dashboards and make it easy to spot guardrails that frequently reach L3 (which indicates a persistent compliance issue, not a one-off).

Data model — Guardrail configurations are stored in the assistant_guardrails table (one row per guardrail per assistant). Evaluation results are stored in guardrail_evaluations with per-message granularity, including pass/fail, reasoning, model used, token count, cost, and duration.