Skip to content

Playground

The Playground is a test bench for tuning assistant behavior. Instead of guessing whether a prompt change works, you can send a message, see the full agent response — including tool calls, token counts, and cost — and iterate on the spot.

Opening the Playground

Click the Playground tab on any assistant’s detail page. The playground opens in its own full-page layout with:

  • The compiled system prompt that would be sent to the LLM
  • Configuration toggles for tools, skills, and MCP servers
  • A test message input
  • A results area showing responses and metrics
  • A run history of past tests

System Prompt

The top of the page shows the fully compiled system prompt — the exact text the LLM will receive, assembled from the personality prompt, template tags, tool descriptions, and all other context sections.

Click Edit to modify the prompt for testing. Changes only apply to that test run — they are never saved back to the assistant. Click View to switch back to the read-only display.

Use the collapse button to hide the prompt section and focus on testing.

Configuration

Model Selection

Choose which model to use for the test run from the Model A dropdown. The choices come from the models available in your installation.

To compare two models side-by-side, select a second model in the Model B dropdown. This runs the same message through both models simultaneously, showing results in separate panels.

Max Tokens

Set the maximum number of tokens for the response. Defaults to 16,000.

Tools, Skills, and MCP Servers

Below the model settings, checkboxes let you toggle individual tools, skills, and MCP servers on or off for the test run. Use Toggle All to quickly select or deselect an entire group.

This lets you test how the assistant behaves with specific capabilities enabled or disabled — for example, disabling search_knowledge to see whether the assistant falls back to its training data.

Running a Test

Type a message and click Run (or press Enter). The playground:

  1. Creates a real conversation tagged with channel_plugin="playground"
  2. Sends your message through the full agent loop — tools execute, knowledge is searched, skills activate, MCP servers respond
  3. Shows a thinking indicator while processing
  4. Displays the response with full metrics when complete
Playground runs use the complete agent loop, the same pipeline as regular chat. This means tool calls, knowledge search, code execution, delegation, and all other capabilities work exactly as they would in a real conversation.

Results

Each result panel shows:

  • Response text — The assistant’s final response, rendered as markdown
  • Tool calls — Which tools were called during the run, with input summaries
  • Model — The model that actually served the request
  • Token counts — Input and output tokens
  • Cost — Estimated cost for the run
  • Duration — How long the run took

Side-by-Side Comparison

When you select a Model B, results appear in two panels labeled A and B. Both runs use the same message and configuration but different models, making it easy to compare response quality, tool usage, speed, and cost.

Run History

The bottom of the page shows your recent test runs (up to 50) for this assistant. Each entry displays the message, model, token counts, cost, and status.

Click a run to expand its full details — response text, tool calls, metrics, and any custom prompt used. Click the delete button to remove a run and its associated conversation.

Technical Details

Real conversations — Each playground run creates a real Conversation record tagged with channel_plugin="playground". This is necessary because the agent loop’s tool context requires a conversation for operations like saving content, searching knowledge, and tracking state. These conversations appear in the normal conversation list.

Async execution — Runs are dispatched as Celery background tasks. The UI polls every 1.5 seconds until the run completes, then displays the results and refreshes the history.

Metrics aggregation — Token counts and costs are aggregated from all LLM calls made during the run. A single run may involve multiple LLM calls if the agent uses tools (each tool call triggers another LLM iteration), so the displayed metrics reflect the total across all iterations.

Comparison mode — In comparison mode, two independent runs share a comparison_group UUID. Each run has its own conversation and executes independently. The UI polls a group endpoint that returns both panels together.

Per-run overrides — The system prompt, model, and allowed tools are stored on each PlaygroundRun record as overrides. The assistant’s saved configuration is never modified.