Skip to content

TypeScript SDK

YAML remains AgentV’s canonical, portable eval format. The SDK surfaces below are for cases where you want to generate YAML-shaped definitions in code, embed eval runs inside another application, or write executable graders and prompt templates. For authoring helpers, @agentv/sdk is AgentV’s public lightweight SDK package.

AgentV currently provides two npm packages for programmatic use:

  • @agentv/sdk — user-facing SDK for evaluate(), YAML-aligned eval authoring, custom assertions, and script graders
  • @agentv/core — core implementation package and typed configuration
Terminal window
# User-facing SDK (evaluate, defineEval, graders, defineAssertion, defineScriptGrader)
npm install @agentv/sdk
# Core configuration helpers (defineConfig)
npm install @agentv/core

Use @agentv/sdk for all new TypeScript SDK code:

Terminal window
npm uninstall @agentv/eval
npm install @agentv/sdk
import { defineScriptGrader } from '@agentv/sdk';

The general policy is hard convergence for same-week or unreleased surface names: use the correct package, field, or wire name instead of carrying aliases. @agentv/eval was already published, then deprecated on npm, and has been removed from this repository. New docs, examples, scaffolds, and skills should use @agentv/sdk directly.

Use the simplest surface that matches the job:

  • YAML / JSONL first for portable eval specs you want to run from the CLI, check into a repo, or share across TypeScript and Python workflows.
  • defineEval() / evalSuite() when you want a .eval.ts file that mirrors YAML concepts and lowers back to the canonical snake_case contract.
  • evaluate({ specFile }) when you want library control around an existing YAML suite.
  • Inline evaluate({ tests }) when the eval definition truly belongs inside application code. The programmatic API mirrors YAML, but uses current TypeScript naming such as expectedOutput.
  • defineAssertion when you want a reusable assertion type discovered from .agentv/assertions/.
  • defineScriptGrader when you need a command-backed grader with explicit score and assertion-result control.
  • agentv eval <verifier.test.ts> for deterministic workspace checks that fit normal Vitest expect(...) tests.

There is no separate first-party Python authoring SDK today. Python-facing workflows should either emit canonical YAML/JSONL or implement executable graders that consume the standard snake_case wire format.

For example, the repo-local helper in examples/features/sdk-python/ can build YAML-shaped cases while keeping assert as the durable authored contract:

from agentv_py.evals import EvalDefinition, JsonlCase, write_eval_yaml, write_jsonl
def rag_faithfulness():
return {
"name": "rag-faithfulness",
"type": "llm-rubric",
"target": "grader-target",
"prompt": "Grade whether the answer is supported by the retrieved context.",
}
write_jsonl(
"evals/cases.jsonl",
[
JsonlCase(
id="grounded-answer",
input=[{"role": "user", "content": "Answer using the retrieved context."}],
expected_output=[{"role": "assistant", "content": "The answer cites the source material."}],
extra={"assert": [rag_faithfulness()]},
)
],
)
write_eval_yaml(
"evals/suite.yaml",
EvalDefinition(name="rag-suite", tests="./cases.jsonl"),
)

This is example-local/repo-local guidance, not a promise of a published Python package.

Use defineEval() from @agentv/sdk when you want TypeScript ergonomics without creating a second eval vocabulary. The helper keeps authoring in camelCase where TypeScript needs it, then lowers back to the canonical snake_case eval object contract when AgentV loads the file.

evals/greeting.eval.ts
import { defineEval, graders } from '@agentv/sdk';
export default defineEval({
name: 'hello-suite',
target: 'mock-sdk',
workspace: {
hooks: {
beforeAll: {
command: ['echo', 'suite-start'],
},
},
},
tests: [
{
id: 'hello',
input: 'Say hello',
inputFiles: ['../fixtures/per-test-note.md'],
expectedOutput: 'Hello from the mock target',
assert: [graders.contains('Hello')],
},
],
});

Useful companion helpers:

  • toEvalYamlObject() returns the canonical snake_case object.
  • serializeEvalYaml() returns YAML text using the same canonical field names.

The durable authored field remains assert. This helper does not introduce a second YAML vocabulary.

@agentv/sdk includes a small graders catalog for common deterministic and LLM-backed grader configs. These helpers return ordinary assert entries and serialize to the same canonical YAML you could write by hand.

import { defineEval, graders } from '@agentv/sdk';
export default defineEval({
name: 'grader-helper-suite',
tests: [
{
id: 'json-greeting',
input: 'Return a JSON greeting.',
assert: [
graders.contains('Hello', { name: 'mentions-hello' }),
graders.exact('{"message":"Hello"}', { name: 'exact-json', minScore: 1 }),
graders.regex(/"message"\s*:/, { name: 'message-key' }),
graders.json({ name: 'valid-json', required: true }),
graders.llmRubric(['Greets the user'], { name: 'rubric-review' }),
graders.llmRubric(undefined, {
name: 'llm-review',
prompt: 'Grade whether the answer is useful.',
target: 'grader-target',
}),
graders.scriptGrader(['bun', 'run', 'graders/check.ts'], { name: 'scripted-check' }),
],
},
],
});

The catalog covers contains, equals/exact, regex, is-json/json, llm-rubric, and script. CamelCase SDK options such as minScore, maxSteps, and rubric scoreRanges lower to min_score, max_steps, and score_ranges when AgentV loads or serializes the suite.

If you are coming from Braintrust scores or DeepEval metrics, keep the reusable logic AgentV-native: write helper factories that return graders.* configs, then let defineEval() lower them to ordinary assert entries.

import { defineEval, graders } from '@agentv/sdk';
function ragFaithfulness() {
return graders.llmRubric(undefined, {
name: 'rag-faithfulness',
target: 'grader-target',
prompt: [
'Grade whether the answer is supported by the retrieved context.',
'Use the input and expected_output fields as grounding evidence.',
].join('\n'),
});
}
export default defineEval({
name: 'rag-suite',
tests: [
{
id: 'grounded-answer',
input: 'Answer the question using the retrieved context.',
expectedOutput: 'The answer cites the source material.',
assert: [
graders.contains('source', { name: 'mentions-source' }),
ragFaithfulness(),
],
},
],
});

The helper above serializes to the same shape you could write by hand:

assert:
- name: mentions-source
type: contains
value: source
- name: rag-faithfulness
type: llm-rubric
target: grader-target
prompt: |-
Grade whether the answer is supported by the retrieved context.
Use the input and expected_output fields as grounding evidence.

Use defineAssertion from @agentv/sdk to create reusable assertion types. Place them in .agentv/assertions/ — they’re auto-discovered by filename.

This is the custom assertion path, not the custom grader path. It matches Promptfoo’s assertion terminology for normal eval checks, while extending Promptfoo’s fixed custom logic types (javascript, python, ruby, webhook) with arbitrary discovered AgentV type names.

.agentv/assertions/min-words.ts
import { defineAssertion } from '@agentv/sdk';
export default defineAssertion(({ output }) => {
const wordCount = (output ?? '').trim().split(/\s+/).filter(Boolean).length;
const pass = wordCount >= 3;
return {
pass,
score: pass ? 1 : 0,
reason: `Output has ${wordCount} words`,
};
});

Return a score (0–1) instead of pass for graded evaluation:

.agentv/assertions/efficiency.ts
import { defineAssertion } from '@agentv/sdk';
export default defineAssertion(({ output, traceSummary }) => {
const hasContent = (output ?? '').length > 0 ? 0.5 : 0;
const isEfficient = (traceSummary?.eventCount ?? 0) <= 10 ? 0.5 : 0;
return {
score: hasContent + isEfficient,
reason: 'Checks content exists and is efficient',
};
});

If only pass is given, score is 1 (pass) or 0 (fail).

Convention-based discovery maps filename → assertion type:

.agentv/assertions/min-words.ts → type: min-words
.agentv/assertions/sentiment.ts → type: sentiment

Reference directly in your eval file — no command: needed:

assert:
- type: min-words
- type: contains
value: "Hello"

Use defineScriptGrader from @agentv/sdk for full control over scoring with optional per-check results:

import { defineScriptGrader } from '@agentv/sdk';
export default defineScriptGrader(({ output, traceSummary }) => ({
pass: (output ?? '').length > 0 && (traceSummary?.eventCount ?? 0) <= 5,
score: (output ?? '').length > 0 && (traceSummary?.eventCount ?? 0) <= 5 ? 1.0 : 0.5,
reason: 'Checks answer text and tool usage',
checks: [
{
text: 'Answer is not empty',
pass: (output ?? '').length > 0,
reason: (output ?? '').length > 0 ? 'Output is non-empty' : 'Output is empty',
},
{
text: 'Efficient tool usage',
pass: (traceSummary?.eventCount ?? 0) <= 5,
reason: 'Trace event count is within limit',
},
],
}));

For deterministic workspace verifiers, prefer normal Vitest tests plus AgentV’s built-in Vitest adapter command:

graders/welcome-banner.test.ts
import { readFileSync } from 'node:fs';
import { expect, it } from 'vitest';
it('links to the dashboard', () => {
const page = readFileSync('app/page.tsx', 'utf8');
expect(page).toMatch(/href=["']\/dashboard["']/);
});
assert:
- name: vitest-welcome-banner
type: script
command: [agentv, eval, graders/welcome-banner.test.ts]

Use defineWorkspaceGrader only for tiny one-off file checks or custom score shaping:

import { defineWorkspaceGrader } from '@agentv/sdk';
export default defineWorkspaceGrader(async ({ workspace }) => [
await workspace.file('app/page.tsx').contains('Status: All systems ready'),
await workspace.file('app/page.tsx').contains('Open dashboard'),
await workspace.file('app/page.tsx').matches(/href=["']\/dashboard["']/),
await workspace.file('app/page.tsx').notMatches(/TODO/i),
]);

defineScriptGrader, defineVitestWorkspaceGrader, and defineWorkspaceGrader custom scripts are graders referenced in YAML with type: script and command: [bun, run, grader.ts]. Plain Vitest verifier files can use command: [agentv, eval, graders/check.test.ts] without a custom wrapper; use agentv eval vitest when you need adapter flags. defineAssertion uses convention-based discovery instead — just place it in .agentv/assertions/ and reference it by assertion type name.

For detailed patterns, input/output contracts, and language-agnostic examples, see Script Graders.

Raw grader stdin uses snake_case because it crosses a process boundary and may be consumed by Python, shell, jq, or external dashboards. The @agentv/sdk package converts that payload to idiomatic TypeScript camelCase before calling your handler.

Raw stdinSDK handler field
expected_outputexpectedOutput
output_pathoutputPath
trace_summarytraceSummary
token_usagetokenUsage
cost_usdcostUsd
duration_msdurationMs
workspace_pathworkspacePath

output is already the final answer string in both formats. Transcript-aware code should read messages, trace.messages, or trace.events; answer-text graders should read output.

Use evaluate() from @agentv/sdk to run evaluations as a library. The implementation is owned by @agentv/core, but the SDK re-exports it as the user-facing entrypoint. The most portable pattern is still to keep the suite in YAML and point specFile at it; inline tests are best when the eval is tightly coupled to application code.

import { evaluate } from '@agentv/sdk';
const { results, summary } = await evaluate({
tests: [
{
id: 'greeting',
input: 'Say hello',
expectedOutput: 'Hello there!',
assert: [{ type: 'contains', value: 'Hello' }],
},
],
});
console.log(`${summary.passed}/${summary.total} passed`);

A strict OR is easy with inline assertion handlers:

import { evaluate } from '@agentv/sdk';
const { summary } = await evaluate({
tests: [
{
id: 'capital',
input: 'What is the capital of France?',
expectedOutput: 'Paris',
assert: [
({ output }) => ({
name: 'capital-or-phrase',
score: ((output ?? '').includes('Paris') || /capital of france/i.test(output ?? '')) ? 1 : 0,
}),
],
},
],
task: async (input) => `Agent: ${input}`,
threshold: 0.8,
});
console.log(`${summary.passed}/${summary.total} passed`);

Auto-discovers the default target from .agentv/targets.yaml and .env credentials.

Point to an existing YAML eval instead of inlining tests:

import { evaluate } from '@agentv/sdk';
const { results, summary } = await evaluate({
specFile: './evals/my-eval.eval.yaml',
});

This is the recommended bridge when you want SDK control without creating a separate code-first eval surface.

Create agentv.config.ts at your project root for type-safe, validated configuration using defineConfig() from @agentv/core:

import { defineConfig } from '@agentv/core';
export default defineConfig({
execution: {
maxConcurrency: 5,
maxRetries: 2,
verbose: true,
},
output: { dir: './results' },
limits: { maxCostUsd: 10.0 },
});

The config file is auto-discovered by the CLI from your project root and validated with Zod at startup.

AgentV stores eval-native run bundles, transcript sidecars, metrics, grading artifacts, and summaries. For OpenTelemetry/OpenInference traces, instrument the system under test or provider wrapper to emit spans directly to the external backend, then correlate AgentV cases to those traces with safe external_trace IDs or UI URLs when available.

Bootstrap new assertions and eval files from the CLI:

Terminal window
# Create a new assertion type
agentv create assertion <name> # → .agentv/assertions/<name>.ts
# Create a new eval with test cases
agentv create eval <name> # → evals/<name>.eval.yaml + .cases.jsonl