Evaluations in TypeScript SDK
Copy page
Manage evaluators, test suites, and batch evaluations programmatically with the TypeScript SDK
The TypeScript SDK provides an EvaluationClient that talks to the Evaluations API so you can manage evaluators, test suites, run configurations, trigger runs with scoring, and read results—all from code.
For full endpoint details and request/response shapes, see the Evaluations API reference.
Setup: create a client
| Parameter | Type | Required | Description |
|---|---|---|---|
tenantId | string | Yes | Your tenant (organization) ID |
projectId | string | Yes | Your project ID |
apiUrl | string | Yes | API base URL (e.g. https://api.inkeep.com or your self-hosted URL) |
apiKey | string | No | Bearer token for authenticated requests |
End-to-end example
The walkthrough below creates an evaluator, a test suite with items, a run configuration, triggers an evaluation, and checks the results. Each step builds on the previous one.
Step 1 — Create an evaluator
An evaluator defines how to score agent output. You provide a prompt, a JSON schema for the structured result, and a model to run the evaluation.
The schema should include a numeric score, a boolean passed, and a reasoning string so you get both a quantitative metric and a human-readable explanation for every evaluation.
| Field | Type | Required | Description |
|---|---|---|---|
name | string | Yes | Display name |
description | string | No | What this evaluator checks |
prompt | string | Yes | Instructions for the evaluation model |
schema | object (JSON Schema) | Yes | Structure of the evaluation output — typically includes score, passed, and reasoning |
model | object | Yes | { model: string, providerOptions?: object } |
passCriteria | object | No | { operator: "and"|"or", conditions: [{ field, operator, value }] }. Operators: >, <, >=, <=, =, != |
Step 2 — Create a test suite and add items
A test suite is a named collection of items. Each item has input (the messages sent to the agent) and optional expectedOutput (reference answer).
Step 3 — Create a run configuration
A run configuration ties the test suite to one or more agents and optional default evaluators. It is a saved "recipe" you can trigger repeatedly.
| Field | Type | Required | Description |
|---|---|---|---|
name | string | Yes | Display name for the config |
description | string | No | Optional description |
datasetId | string | Yes | The test suite to run |
agentIds | string[] | No | Agents that process each item (at least one needed before trigger) |
evaluatorIds | string[] | No | Default evaluators attached when you trigger a run |
Step 4 — Trigger a run
Starting a run queues every item × agent combination. When evaluators are included, a batch evaluation job is automatically created for the resulting conversations.
The evaluatorIds on trigger is optional — omit it to use the defaults from the run configuration, or pass different ids to override.
Step 5 — Check run status
Poll getDatasetRun or check the Test Suites page in the Visual Builder to watch items transition from pending to completed with evaluation scores.
Step 6 (optional) — Evaluate a previous run after the fact
If you ran a test suite without evaluators and want to score those conversations later, trigger a batch evaluation scoped to the run:
Method reference
Evaluators
| Method | Purpose |
|---|---|
createEvaluator(data) | Create an evaluator (prompt, schema, model, optional pass criteria) |
listEvaluators() | List all evaluators in the project |
getEvaluator(evaluatorId) | Fetch one evaluator |
updateEvaluator(evaluatorId, partial) | Update evaluator fields |
deleteEvaluator(evaluatorId) | Delete an evaluator |
Test suites and items
| Method | Purpose |
|---|---|
listDatasets() | List test suites for the project |
getDataset(testSuiteId) | Fetch one test suite |
createDataset({ name }) | Create a test suite |
updateDataset(testSuiteId, partial) | Update name, etc. |
deleteDataset(testSuiteId) | Delete a test suite and its items |
listDatasetItems(testSuiteId) | List items |
getDatasetItem(testSuiteId, itemId) | Fetch one item |
createDatasetItem(testSuiteId, itemData) | Create an item (input required; expectedOutput optional) |
createDatasetItems(testSuiteId, items[]) | Bulk create |
updateDatasetItem(testSuiteId, itemId, partial) | Update an item |
deleteDatasetItem(testSuiteId, itemId) | Delete an item |
Run configurations and runs
| Method | Purpose |
|---|---|
createDatasetRunConfig({ name, datasetId, agentIds?, evaluatorIds? }) | Create a run configuration (which agents run the suite; optional default evaluators) |
triggerDatasetRun(runConfigId, { evaluatorIds?, branchName? }?) | Start a run; returns datasetRunId, status, totalItems |
listDatasetRuns(testSuiteId) | List runs for a test suite |
getDatasetRun(runId) | Fetch a run with items and conversations |
Batch evaluation
| Method | Purpose |
|---|---|
triggerBatchEvaluation({ evaluatorIds, name?, conversationIds?, dateRange?, datasetRunIds? }) | One-off batch evaluation over conversations |
| Option | Type | Required | Description |
|---|---|---|---|
evaluatorIds | string[] | Yes | IDs of evaluators to run |
name | string | No | Name for the job |
conversationIds | string[] | No | Limit to these conversations |
dateRange | { startDate, endDate } (YYYY-MM-DD) | No | Limit to conversations in this date range |
datasetRunIds | string[] | No | Limit to conversations from these test suite runs |
Evaluation suite configs (continuous tests)
Suite configs group evaluators and optional agent filters and sample rates for continuous tests that evaluate a fraction of live conversations automatically.
| Method | Purpose |
|---|---|
createEvaluationSuiteConfig({ evaluatorIds, sampleRate?, filters? }) | Create a suite config |
addEvaluatorToSuiteConfig(configId, evaluatorId) | Add an evaluator |
removeEvaluatorFromSuiteConfig(configId, evaluatorId) | Remove an evaluator |
listEvaluationSuiteConfigEvaluators(configId) | List evaluators on a config |
| Option | Type | Required | Description |
|---|---|---|---|
evaluatorIds | string[] | Yes | At least one evaluator ID |
sampleRate | number | No | Fraction of matching conversations to evaluate (0–1) |
filters | object | No | Restrict scope, e.g. { agentIds: ["agent-id"] } |
To list results by job or run config, use the Evaluations API.
Related
- Evaluations API reference — Full list of evaluation endpoints and schemas
- Visual Builder: Evaluations — Configure evaluators, batch evaluations, and continuous tests in the UI
- Test suites — How test suites work in the Visual Builder