Datasets

Datasets are named collections of structured JSON documents within a project. While the knowledge base handles unstructured content like web pages, documents, and facts, datasets are designed for structured data — product catalogs, tour inventories, pricing tables, event listings, and similar tabular or record-oriented data.

Assistants can search datasets using both semantic search (natural language queries like “tours good for families”) and structured SQL queries (precise filtering like “tours under $2000 departing in March”).

Creating a Dataset

Navigate to a project and click the Datasets tab, then New Dataset. You’ll need to provide:

Field	Required	Description
Name	Yes	Human-readable name (e.g. “Tour Catalog”, “Product Inventory”)
Description	No	What this dataset contains and how it should be used
Key Field	No	JSON field name used as a unique key for upsert uploads (e.g. `tour_id`, `sku`)

A URL-safe slug is automatically generated from the name.

Uploading Data

Data can be uploaded through the admin UI or via the REST API.

Admin UI Upload

From the dataset detail page, use the upload form to submit a file:

JSON file — A .json file containing an array of objects. Each object becomes one record.
ZIP file — A .zip archive where each .json file inside is one record object. Non-JSON files and macOS resource fork entries (__MACOSX/) are silently skipped.

REST API Upload

Use a project API key to upload records programmatically:

POST /api/v1/projects/<project_id>/datasets/<slug>/records

Headers:

Authorization: Bearer YOUR_PROJECT_API_KEY
Content-Type: application/json

Request body:

{
    "records": [
        {"sku": "W-001", "name": "Widget A", "price": 29.99},
        {"sku": "W-002", "name": "Widget B", "price": 49.99}
    ],
    "mode": "replace"
}

Field	Required	Description
`records`	Yes	Array of JSON objects (max 10,000 per request, 1MB max per record)
`mode`	No	`"replace"` (default) drops all existing records first; `"upsert"` merges by `key_field`
`key_field`	Conditional	Required when mode is `"upsert"`. Names the field used as the unique key.

Response (202 Accepted):

{
    "dataset": "tour-catalog",
    "records_received": 150,
    "mode": "replace",
    "status": "processing",
    "status_url": "/api/v1/projects/1/datasets/tour-catalog/status"
}

Records are stored immediately. Embedding runs asynchronously in the background — poll the status_url to check progress.

Other API Endpoints

Endpoint	Method	Description
`/api/v1/projects/<id>/datasets`	GET	List all datasets in the project
`/api/v1/projects/<id>/datasets/<slug>/status`	GET	Check embedding status
`/api/v1/projects/<id>/datasets/<slug>/records`	GET	Read records (paginated)
`/api/v1/projects/<id>/datasets/<slug>/records`	DELETE	Clear all records

All API endpoints require a project API key via Authorization: Bearer header.

How Search Works

When records are uploaded, TeamWeb AI processes them in two ways:

Text representation — Each record is flattened into readable text with dot-notation keys (e.g. name: Widget A, pricing.retail: 29.99) and embedded as a vector for semantic search
JSONB storage — The raw JSON is stored in a PostgreSQL JSONB column with a GIN index for fast structured queries

This hybrid approach means assistants can use natural language to find conceptually relevant records and precise SQL to filter by exact field values, numeric ranges, or aggregations.

Schema Inference

TeamWeb AI automatically infers a schema from uploaded records, including nested fields. Top-level fields are typed as string, number, boolean, object, or array. Nested objects are discovered using dot-notation (e.g. supplier.name) and arrays of objects use bracket notation (e.g. departures[].date). Inference is depth-limited to 3 levels.

The inferred schema is shown on the dataset detail page and included in the assistant’s system prompt.

Field Configuration

After uploading records, click the Fields button on the dataset detail page to open the Field Configuration page. This page displays all discovered fields in a hierarchical tree, showing nested objects and arrays with proper indentation.

Each field has two checkboxes:

Option	Effect
Important	Field is included in search result summaries shown to the assistant
Ignore	Field is excluded from AI processing (embedding and chunking) and set to `null` on future uploads

A field cannot be both important and ignored. Ignoring a parent field (e.g. an array or object) automatically ignores all of its children.

Important fields control what the assistant sees in search_knowledge results on dataset records. For records with deeply nested or complex structures, you can select specific sub-fields from arrays — for example, marking only departures[].departure_date and departures[].now as important means the assistant sees just those two values from each departure item, not the full 20+ field departure objects.

If no fields are marked important, TeamWeb AI auto-selects top-level scalar fields (skipping objects, arrays, and any ignored fields).

Ignored fields are stripped from the text representation used for embedding, so they don’t affect search relevance. When new records are uploaded (via the admin UI or REST API), ignored field values are set to null before storage. This is useful for excluding large or irrelevant data like internal IDs, tracking codes, or verbose nested structures that would dilute search quality.

Saving field configuration triggers re-embedding so that the changes take effect immediately for search.

The dataset detail page shows two record views:

Agent View — what the assistant sees: only the important fields for each record
Full Data — the complete JSON document

Keep important fields compact. With 5-8 fields, 10 search results fit in roughly 2-3 KB — well within the 8 KB limit for voice conversations.

Assistant Access

Datasets are enabled per assistant. On the assistant’s edit page, the Datasets section shows all datasets in the project. Check the ones you want the assistant to be able to search and query.

Once enabled, the assistant can search and query dataset records using these tools:

Tool	Description
search_knowledge	Semantic search that includes dataset records alongside knowledge articles. Returns compact summaries using display fields. Pass the optional `dataset` parameter to limit results to a specific dataset.
query_dataset	Run a SQL SELECT query against a specific dataset for precise filtering, sorting, and aggregation
describe_dataset	Get the full schema, field types, sample values, and record count for a dataset. Useful for the assistant to understand what data is available before searching or querying.
get_record	Fetch a single complete record by its key value. Use after `search_knowledge` to drill into full details of a specific result.

Dataset records are included in search_knowledge results as a unified source type — the same search quality settings (similarity threshold, re-ranking) apply to both knowledge articles and dataset records. See Tool Configuration for details.

The assistant’s system prompt includes a compact overview of each enabled dataset with its display fields. The assistant can call describe_dataset to discover the full schema at runtime.

A typical assistant workflow is: search_knowledge to find candidates, then get_record for full details of a specific result, and query_dataset for precise SQL filtering.

Dataset access is controlled per assistant, not per project. Two assistants on the same project can have different datasets enabled.

Limits

Limit	Value
Datasets per project	50
Records per upload	10,000
Record size	1 MB per JSON document

Project API Keys

Project API keys authenticate requests to the dataset REST API. Unlike assistant-scoped API keys (used for Web Trigger), project API keys grant access to all datasets within a project.

Manage project API keys from Admin > API Keys in the sidebar, under the Project Keys tab.

Keys are 64-character hex tokens generated by TeamWeb AI
The full key is shown only once at creation — copy and store it securely
Each key is scoped to a single project and identified by a name and an 8-character prefix
Keys can be revoked at any time
Multiple keys can be created per project for key rotation

Technical Details

Embedding pipeline — When records are uploaded (via UI or API), they are stored immediately in the database. A background Celery task then generates text representations, embeds them as vectors, and stores the resulting chunks in the knowledge chunk table with source_type = "dataset_record". The dataset’s embedding_status field tracks progress: pending → processing → complete (or failed).

Concurrent upload protection — The embedding task acquires a per-dataset Redis lock to prevent race conditions when multiple uploads happen in quick succession. If a second upload arrives while embedding is in progress, the new records are stored but the embedding task is re-queued with a short delay.

SQL query sandboxing — The query_dataset tool executes assistant-written SQL in a heavily sandboxed environment: the SQL is parsed and validated using an AST parser (sqlglot), only SELECT statements are allowed, table references are restricted to the dataset alias, function calls are checked against an allowlist of safe functions, the query runs in a read-only transaction with a 5-second timeout, and results are capped at 100 rows.

JSONB query syntax — Assistants write SQL using PostgreSQL JSONB operators. The system prompt includes examples: data->>'field_name' for text access, (data->>'price')::numeric for numeric comparisons. The dataset’s records are presented as a virtual table named after the dataset.

Content Guidelines Plugins