Datasets
Datasets are named collections of structured JSON documents within a project. While the knowledge base handles unstructured content like web pages, documents, and facts, datasets are designed for structured data — product catalogs, tour inventories, pricing tables, event listings, and similar tabular or record-oriented data.
Assistants can search datasets using both semantic search (natural language queries like “tours good for families”) and structured SQL queries (precise filtering like “tours under $2000 departing in March”).
Creating a Dataset
Navigate to a project and click the Datasets tab, then New Dataset. You’ll need to provide:
| Field | Required | Description |
|---|---|---|
| Name | Yes | Human-readable name (e.g. “Tour Catalog”, “Product Inventory”) |
| Description | No | What this dataset contains and how it should be used |
| Key Field | No | JSON field name used as a unique key for upsert uploads (e.g. tour_id, sku) |
A URL-safe slug is automatically generated from the name.
Uploading Data
Data can be uploaded through the admin UI or via the REST API.
Admin UI Upload
From the dataset detail page, use the upload form to submit a file:
- JSON file — A
.jsonfile containing an array of objects. Each object becomes one record. - ZIP file — A
.ziparchive where each.jsonfile inside is one record object. Non-JSON files and macOS resource fork entries (__MACOSX/) are silently skipped.
REST API Upload
Use a project API key to upload records programmatically:
POST /api/v1/projects/<project_id>/datasets/<slug>/recordsHeaders:
Authorization: Bearer YOUR_PROJECT_API_KEY
Content-Type: application/jsonRequest body:
{
"records": [
{"sku": "W-001", "name": "Widget A", "price": 29.99},
{"sku": "W-002", "name": "Widget B", "price": 49.99}
],
"mode": "replace"
}| Field | Required | Description |
|---|---|---|
records | Yes | Array of JSON objects (max 10,000 per request, 1MB max per record) |
mode | No | "replace" (default) drops all existing records first; "upsert" merges by key_field |
key_field | Conditional | Required when mode is "upsert". Names the field used as the unique key. |
Response (202 Accepted):
{
"dataset": "tour-catalog",
"records_received": 150,
"mode": "replace",
"status": "processing",
"status_url": "/api/v1/projects/1/datasets/tour-catalog/status"
}Records are stored immediately. Embedding runs asynchronously in the background — poll the status_url to check progress.
Other API Endpoints
| Endpoint | Method | Description |
|---|---|---|
/api/v1/projects/<id>/datasets | GET | List all datasets in the project |
/api/v1/projects/<id>/datasets/<slug>/status | GET | Check embedding status |
/api/v1/projects/<id>/datasets/<slug>/records | GET | Read records (paginated) |
/api/v1/projects/<id>/datasets/<slug>/records | DELETE | Clear all records |
All API endpoints require a project API key via Authorization: Bearer header.
How Search Works
When records are uploaded, TeamWeb AI processes them in two ways:
- Text representation — Each record is flattened into readable text with dot-notation keys (e.g.
name: Widget A,pricing.retail: 29.99) and embedded as a vector for semantic search - JSONB storage — The raw JSON is stored in a PostgreSQL JSONB column with a GIN index for fast structured queries
This hybrid approach means assistants can use natural language to find conceptually relevant records and precise SQL to filter by exact field values, numeric ranges, or aggregations.
Schema Inference
TeamWeb AI automatically infers a schema from uploaded records, including nested fields. Top-level fields are typed as string, number, boolean, object, or array. Nested objects are discovered using dot-notation (e.g. supplier.name) and arrays of objects use bracket notation (e.g. departures[].date). Inference is depth-limited to 3 levels.
The inferred schema is shown on the dataset detail page and included in the assistant’s system prompt.
Field Configuration
After uploading records, click the Fields button on the dataset detail page to open the Field Configuration page. This page displays all discovered fields in a hierarchical tree, showing nested objects and arrays with proper indentation.
Each field has two checkboxes:
| Option | Effect |
|---|---|
| Important | Field is included in search result summaries shown to the assistant |
| Ignore | Field is excluded from AI processing (embedding and chunking) and set to null on future uploads |
A field cannot be both important and ignored. Ignoring a parent field (e.g. an array or object) automatically ignores all of its children.
Important fields control what the assistant sees in search_knowledge results on dataset records. For records with deeply nested or complex structures, you can select specific sub-fields from arrays — for example, marking only departures[].departure_date and departures[].now as important means the assistant sees just those two values from each departure item, not the full 20+ field departure objects.
If no fields are marked important, TeamWeb AI auto-selects top-level scalar fields (skipping objects, arrays, and any ignored fields).
Ignored fields are stripped from the text representation used for embedding, so they don’t affect search relevance. When new records are uploaded (via the admin UI or REST API), ignored field values are set to null before storage. This is useful for excluding large or irrelevant data like internal IDs, tracking codes, or verbose nested structures that would dilute search quality.
Saving field configuration triggers re-embedding so that the changes take effect immediately for search.
The dataset detail page shows two record views:
- Agent View — what the assistant sees: only the important fields for each record
- Full Data — the complete JSON document
Assistant Access
Datasets are enabled per assistant. On the assistant’s edit page, the Datasets section shows all datasets in the project. Check the ones you want the assistant to be able to search and query.
Once enabled, the assistant can search and query dataset records using these tools:
| Tool | Description |
|---|---|
| search_knowledge | Semantic search that includes dataset records alongside knowledge articles. Returns compact summaries using display fields. Pass the optional dataset parameter to limit results to a specific dataset. |
| query_dataset | Run a SQL SELECT query against a specific dataset for precise filtering, sorting, and aggregation |
| describe_dataset | Get the full schema, field types, sample values, and record count for a dataset. Useful for the assistant to understand what data is available before searching or querying. |
| get_record | Fetch a single complete record by its key value. Use after search_knowledge to drill into full details of a specific result. |
Dataset records are included in search_knowledge results as a unified source type — the same search quality settings (similarity threshold, re-ranking) apply to both knowledge articles and dataset records. See Tool Configuration for details.
The assistant’s system prompt includes a compact overview of each enabled dataset with its display fields. The assistant can call describe_dataset to discover the full schema at runtime.
A typical assistant workflow is: search_knowledge to find candidates, then get_record for full details of a specific result, and query_dataset for precise SQL filtering.
Limits
| Limit | Value |
|---|---|
| Datasets per project | 50 |
| Records per upload | 10,000 |
| Record size | 1 MB per JSON document |
Project API Keys
Project API keys authenticate requests to the dataset REST API. Unlike assistant-scoped API keys (used for Web Trigger), project API keys grant access to all datasets within a project.
Manage project API keys from Admin > API Keys in the sidebar, under the Project Keys tab.
- Keys are 64-character hex tokens generated by TeamWeb AI
- The full key is shown only once at creation — copy and store it securely
- Each key is scoped to a single project and identified by a name and an 8-character prefix
- Keys can be revoked at any time
- Multiple keys can be created per project for key rotation
Technical Details
Embedding pipeline — When records are uploaded (via UI or API), they are stored immediately in the database. A background Celery task then generates text representations, embeds them as vectors, and stores the resulting chunks in the knowledge chunk table with source_type = "dataset_record". The dataset’s embedding_status field tracks progress: pending → processing → complete (or failed).
Concurrent upload protection — The embedding task acquires a per-dataset Redis lock to prevent race conditions when multiple uploads happen in quick succession. If a second upload arrives while embedding is in progress, the new records are stored but the embedding task is re-queued with a short delay.
SQL query sandboxing — The query_dataset tool executes assistant-written SQL in a heavily sandboxed environment: the SQL is parsed and validated using an AST parser (sqlglot), only SELECT statements are allowed, table references are restricted to the dataset alias, function calls are checked against an allowlist of safe functions, the query runs in a read-only transaction with a 5-second timeout, and results are capped at 100 rows.
JSONB query syntax — Assistants write SQL using PostgreSQL JSONB operators. The system prompt includes examples: data->>'field_name' for text access, (data->>'price')::numeric for numeric comparisons. The dataset’s records are presented as a virtual table named after the dataset.