Testing Workflows

The workflow test runner lets you test a workflow end-to-end from the visual editor, with full visibility into what the engine is doing at every step.

Launching a Test

From the workflow editor, click the Test button in the header toolbar (next to Save and Validate). The editor will:

Save the current workflow
Validate the graph structure – if there are errors, testing is blocked until they are fixed
Create a test conversation linked to the first assistant attached to this workflow
Redirect to the test page

If the workflow is linked to multiple assistants, select which assistant to test with before clicking Test.

Test conversations are flagged internally and do not appear in normal conversation lists, the dashboard, or the pending inbox. They are only visible from the workflow editor’s Test Runs tab.

The Test Page

The test page is a two-panel layout:

Left: Chat – A live conversation with the assistant, identical to a normal chat. Send messages and see responses in real time.
Right: Debug Trace – Three tabs showing execution details as they happen.

Timeline Tab

A vertical state progression showing where the workflow has been and where it is now:

Completed states (green dot) – shows duration, nested tool calls with success/failure indicators, guard evaluations with pass/fail and reason text, and which transition fired
Active state (pulsing dot) – the state currently being processed
Pending states (hollow dots) – states defined in the workflow that haven’t been reached yet

Tool calls are shown inline with their arguments and result. Guard evaluations show the guard type, whether it passed or failed, and the reason – useful for understanding why a transition did or didn’t fire.

Data Tab

A live key/value table of the workflow’s gathered data. Each entry shows:

Key – the data field name
Value – the current value
Source – where the data came from (user_stated, inferred, tool_result, deterministic_operation)

New entries flash briefly when recorded so you can see data flowing in real time.

Events Tab

A raw chronological log of every event the workflow engine emits. Filterable by event type:

Event Type	What It Captures
State changes	Transitions between states
Data recorded	New gathered data entries
Guard evaluations	Each guard checked, pass/fail with reason
Tool calls	Workflow tool executions with arguments and results
Deterministic steps	Operations in logic states (set_data, http_request, etc.)
Transitions	Summary of transition evaluation (how many candidates, which won)

Click any event row to expand its full JSON payload. This is the “firehose” view for debugging edge cases where the timeline doesn’t show enough detail.

Test Runs Tab

Back in the workflow editor, click the Test Runs tab (next to Canvas) to see a list of past test runs for this workflow:

Status – Active, Completed, or Abandoned
Started – When the test was created
State – The last workflow state reached
View – Click to open the test page for that run

Click New Test to start another test run. Click Clean Up to delete all test conversations and their data.

Viewing Past Runs

Clicking a completed test run opens the same test page, but populated from the stored event log rather than live SSE. The timeline, data, and events tabs all show the historical execution trace. This is useful for reviewing what happened after the fact or comparing runs.

Event Persistence

Every event emitted during a workflow run is stored in the run’s event_log – an append-only list on the WorkflowRun model. This data persists permanently and powers both the live debug view (via SSE) and historical run inspection.

Guard evaluations are also stored directly in the workflow history entries, making each state transition self-documenting. You can see not just that a transition happened, but which guards were checked, which passed, which failed, and why.

CLI Test Runner

For repeatable, automated testing, use the command-line test runner. Define test scenarios as YAML files with a sequence of user messages and optional assertions, then run them against the real agent and LLM stack.

Scenario format

Create YAML files in the workflow_tests/ directory at the project root:

name: "Vacation - happy path"
description: "User provides both dates upfront"
workflow: "Vacation"
assistant: "Mandy"

steps:
  - message: "I'd like vacation from July 1st to July 5th"
    description: "Provide both dates"
    assert:
      state: "Output 1"
      data:
        start_date: { contains: "July 1" }
        end_date: { contains: "July 5" }
      data_keys:
        - start_date
        - end_date
      response_contains: "vacation"
      response_not_contains: "error"

Identify the workflow and assistant by name or ID (workflow_id: 2, assistant_id: 1). If only one assistant is linked to the workflow, it is selected automatically.

Assertions

All assertions are optional. Running without assertions shows the full execution trace (states, data, tools, responses), which is useful for exploring how a workflow behaves before writing assertions.

Assertion	Description
`state`	Current workflow state matches exactly
`state_not`	Current workflow state must not be this value
`state_visited`	List of states that must appear in the run history
`workflow_status`	Run status: `active`, `completed`, or `abandoned`
`data`	Key/value checks on gathered data (see matchers below)
`data_keys`	List of keys that must exist in gathered data
`tools_called`	Tool names that must appear in the step’s events
`tools_not_called`	Tool names that must not appear
`response_contains`	Substring that must appear in the assistant’s response (case-insensitive)
`response_not_contains`	Substring that must not appear

Data matchers

Data values in assertions support flexible comparison:

data:
  team_size: 50                         # exact match
  budget: { greater_than: 0 }           # numeric
  name: { contains: "Smith" }           # substring
  email: { matches: ".+@.+\\..+" }      # regex
  status: { not_contains: "error" }     # absence
  manager: { exists: true }             # key exists with any value

Seed data

Pre-populate gathered data before the first message:

initial_data:
  source_channel: "website"
  priority: "high"

Running scenarios

# Run a scenario
docker exec teamwebai-web-1 uv run flask workflow-test run workflow_tests/my_scenario.yaml

# Verbose output (system prompts, full events)
docker exec teamwebai-web-1 uv run flask workflow-test run workflow_tests/my_scenario.yaml -v

# Keep the test conversation for inspection
docker exec teamwebai-web-1 uv run flask workflow-test run workflow_tests/my_scenario.yaml --keep

# Stop at first failure
docker exec teamwebai-web-1 uv run flask workflow-test run workflow_tests/my_scenario.yaml --stop-on-failure

# Override workflow or assistant
docker exec teamwebai-web-1 uv run flask workflow-test run workflow_tests/my_scenario.yaml --workflow-id 5

The runner creates an isolated test conversation, sends each message through the agent, and checks assertions after each step. Output shows state transitions, gathered data, tool calls, the assistant’s response, and pass/fail for each assertion.

Exit codes: 0 = all passed, 1 = assertion failures, 2 = error.

Other commands

# Validate a scenario without running it
docker exec teamwebai-web-1 uv run flask workflow-test validate workflow_tests/my_scenario.yaml

# Export an existing conversation as a scenario
docker exec teamwebai-web-1 uv run flask workflow-test export 43 --output workflow_tests/exported.yaml

# Export with skeleton assertions from observed state/data
docker exec teamwebai-web-1 uv run flask workflow-test export 43 --with-assertions

# List available scenarios
docker exec teamwebai-web-1 uv run flask workflow-test list

The export command is useful for capturing a conversation you tested manually and turning it into a repeatable scenario.

Tips

Test early and often – save and test after adding each state rather than building the entire workflow first
Watch the guards – the Timeline tab shows guard pass/fail with reasons, which is the fastest way to understand why a transition isn’t firing
Check the Data tab – if a guard expects a data key that hasn’t been recorded yet, you’ll see it immediately
Use the Events tab for edge cases – when the timeline doesn’t show enough detail (e.g., operation-level debugging in logic states), filter the events tab to deterministic_step
Clean up regularly – test runs accumulate. Use the Clean Up button on the Test Runs tab to delete old test conversations
Use CLI scenarios for regression testing – once a workflow works correctly, export the conversation or write a scenario so you can re-verify after code changes
Start without assertions – run a scenario with no assert: blocks first to see what the workflow actually does, then add assertions based on the observed behavior

Webhooks