Script Graders

Script graders are scripts that evaluate agent responses deterministically. Write them in any language — Python, TypeScript, Node, or any executable.

Use script graders when you need a command-backed scoring component with explicit score control. If you only need a reusable assertion type that can be referenced by name from .agentv/assertions/, use Custom Assertions instead.

Contract

Script graders receive eval context via stdin JSON and return a result via stdout.

Input (stdin, raw wire format):

{
  "input": [{ "role": "user", "content": "What is 15 + 27?" }],
  "input_files": [],
  "output": "The answer is 42.",
  "expected_output": [{ "role": "assistant", "content": "42" }],
  "messages": [{ "role": "assistant", "content": "The answer is 42." }],
  "trace_summary": {
    "event_count": 1,
    "tool_calls": {},
    "error_count": 0,
    "llm_call_count": 1
  }
}

Raw grader stdin is a process-boundary wire format, so keys are snake_case. TypeScript and JavaScript graders that use @agentv/sdk receive the same payload converted to camelCase. The repo-local Python helper in examples/features/sdk-python/ keeps the same snake_case field names.

Raw stdin key	TypeScript SDK field	Meaning
`output`	`output`	Final answer / scored result as a string
`messages`	`messages`	Transcript messages for transcript-aware graders
`expected_output`	`expectedOutput`	Reference answer messages
`output_path`	`outputPath`	Temp file containing large final answer JSON, when used
`trace_summary`	`traceSummary`	Lightweight metrics summary
`token_usage`	`tokenUsage`	Token usage metrics
`cost_usd`	`costUsd`	Estimated cost in USD
`duration_ms`	`durationMs`	Total execution duration
`workspace_path`	`workspacePath`	Temp workspace path, when configured

Do not treat output as a message array. Use output for answer-text checks, and use messages, trace.messages, or trace.events only when the grader intentionally evaluates transcript or tool behavior.

JSON output (full protocol)

Emit a JSON object for numeric scores or multi-aspect results:

{
  "pass": true,
  "score": 1.0,
  "reason": "Answer contains the correct value.",
  "checks": [
    { "text": "Answer contains correct value (42)", "pass": true, "reason": "42 appears in the output." }
  ]
}

Output Field	Type	Description
`pass`	`boolean`	Aggregate pass/fail decision
`score`	`number`	0.0 to 1.0
`reason`	`string`	Explanation for the aggregate decision
`checks`	`Array<{ text, pass, score?, reason, evidence? }>`	Optional per-aspect results with verdict, optional score, reason, and evidence

Plain-text output (exit-code convention)

For simple pass/fail checks, skip the JSON protocol entirely. The exit code determines the score and stdout becomes the check text:

Exit code	Score	Verdict
0	1.0	pass
non-zero (no stderr)	0.0	fail

#!/bin/bash
# check-pages.sh — passes when PDF has at least 5 pages
pages=$(pdfinfo report.pdf | grep Pages | awk '{print $2}')
if [ "$pages" -ge 5 ]; then
  echo "PDF has $pages pages (≥5 required)"
else
  echo "PDF has only $pages pages (<5 required)"
  exit 1
fi

assert:
  - type: script
    command: [bash, scripts/check-pages.sh]

Silent one-liners work too — stdout is optional:

assert:
  - type: script
    command: ["bash", "-c", "[ $(wc -l < output.txt) -ge 10 ]"]

Scripts that write to stderr and exit non-zero surface as execution errors rather than quality failures.

Python Example

This version uses the raw stdin/stdout contract and works in any Python environment:

import json, sys
data = json.load(sys.stdin)
output = data.get("output") or ""

checks = []

if "42" in output:
    checks.append({"text": "Output contains correct value (42)", "pass": True, "reason": "42 appears in the output"})
else:
    checks.append({"text": "Output contains correct value (42)", "pass": False, "reason": "42 is missing from the output"})

passed = sum(1 for check in checks if check["pass"])
score = passed / len(checks) if checks else 0.0

print(json.dumps({
    "pass": passed == len(checks),
    "score": score,
    "reason": f"{passed}/{len(checks)} checks passed",
    "checks": checks,
}))

The repo-local helper in examples/features/sdk-python/ wraps the same contract for that example checkout:

from agentv_py.grader import Check, ScriptGraderResult, define_script


def evaluate(context):
    candidate = context.output or ""
    passed = "42" in candidate
    return ScriptGraderResult(
        pass_=passed,
        score=1.0 if passed else 0.0,
        reason="Answer contains the correct value" if passed else "Answer is missing the correct value",
        checks=[
            Check(
                text="Output contains correct value (42)",
                pass_=passed,
                reason="42 appears in the output" if passed else "42 is missing from the output",
            )
        ],
    )

if __name__ == "__main__":
    define_script(evaluate)

Deprecated wire aliases like output_text, input_text, reference_answer, and expected_output_text are not accepted by the Python helper.

TypeScript Example

import { readFileSync } from "fs";

const data = JSON.parse(readFileSync("/dev/stdin", "utf-8"));
const output: string = data.output ?? "";

const checks: Array<{ text: string; pass: boolean; reason: string }> = [];

if (output.includes("42")) {
  checks.push({ text: "Output contains correct value (42)", pass: true, reason: "42 appears in the output" });
} else {
  checks.push({ text: "Output contains correct value (42)", pass: false, reason: "42 is missing from the output" });
}

const passed = checks.filter(check => check.pass).length;

console.log(JSON.stringify({
  pass: passed === checks.length,
  score: checks.length === 0 ? 0.0 : passed / checks.length,
  reason: `${passed}/${checks.length} checks passed`,
  checks,
}));

Referencing in Eval Files

assert:
  - name: my_validator
    type: script
    command: [./validators/check_answer.py]

TypeScript SDK

The @agentv/sdk package provides a declarative API with automatic stdin/stdout handling. Use defineScriptGrader to skip protocol boilerplate:

#!/usr/bin/env bun
import { defineScriptGrader } from '@agentv/sdk';

export default defineScriptGrader(({ output, criteria }) => {
  const outputText = output ?? '';
  const checks: Array<{ text: string; pass: boolean; reason: string }> = [];

  if (outputText.includes(criteria)) {
    checks.push({ text: 'Output matches expected outcome', pass: true, reason: 'Criteria text appears in the output.' });
  } else {
    checks.push({ text: 'Output matches expected outcome', pass: false, reason: 'Criteria text is missing from the output.' });
  }

  const passed = checks.filter(check => check.pass).length;
  return {
    pass: checks.length > 0 && passed === checks.length,
    score: checks.length === 0 ? 0 : passed / checks.length,
    reason: `${passed}/${checks.length} checks passed`,
    checks,
  };
});

Vitest Workspace Verifiers

For deterministic workspace checks, prefer a normal Vitest verifier file. This matches the common hidden-verifier pattern: read files from the prepared workspace and use expect(...).

import { readFileSync } from 'node:fs';
import { join } from 'node:path';
import { describe, expect, it } from 'vitest';

function readWorkspaceFile(relativePath: string) {
  return readFileSync(join(process.env.AGENTV_WORKSPACE_PATH ?? process.cwd(), relativePath), 'utf8');
}

describe('welcome banner', () => {
  const page = () => readWorkspaceFile('app/page.tsx');

  it('shows ready status text', () => {
    expect(page()).toContain('Status: All systems ready');
  });

  it('links the call to action to /dashboard', () => {
    expect(page()).toMatch(/href=["']\/dashboard["']/);
  });
});

Then use AgentV’s built-in Vitest adapter as the script command. The adapter copies verifier files into a temporary workspace-local path when needed, runs Vitest in workspace_path, reads the JSON reporter output, and maps each test outcome to an AgentV check:

assert:
  - name: vitest-welcome-banner
    type: script
    command: [agentv, eval, graders/welcome-banner.test.ts]

AgentV infers the Vitest adapter for verifier-looking files such as *.test.ts, *.spec.ts, and Vercel-style EVAL.ts. Use agentv eval vitest --in-workspace verifiers/welcome-banner.test.ts when the verifier file is already materialized inside the prepared workspace or you need other adapter options. Use the SDK’s defineVitestWorkspaceGrader() only when embedding the adapter in a custom script or custom command. See examples/features/vitest-workspace-grader/ for a runnable example.

Lower-Level Workspace Helpers

For tiny one-off file checks, defineWorkspaceGrader can resolve the workspace path, read files relative to the workspace, build checks, and aggregate the score:

#!/usr/bin/env bun
import { defineWorkspaceGrader } from '@agentv/sdk';

export default defineWorkspaceGrader(async ({ workspace }) => [
  await workspace.file('app/page.tsx').contains('Status: All systems ready'),
  await workspace.file('app/page.tsx').contains('Open dashboard'),
  await workspace.file('app/page.tsx').matches(/href=["']\/dashboard["']/),
  await workspace.file('app/page.tsx').notMatches(/TODO/i),
]);

Prefer Vitest verifiers when the checks naturally fit expect(...). Use defineWorkspaceGrader when you need a very small custom script, custom weighting, or details that do not map cleanly to individual test outcomes.

SDK exports: defineScriptGrader, defineVitestWorkspaceGrader, defineWorkspaceGrader, Message, ToolCall, Trace, TraceSummary, ScriptGraderInput, ScriptGraderResult, ScriptGraderCheck, Workspace, WorkspaceCheck

Target Access

Script graders can call an LLM through a target proxy for metrics that require multiple LLM calls (contextual precision, semantic similarity, etc.).

Configuration

Add a target block to the grader config:

assert:
  - name: contextual-precision
    type: script
    command: [bun, scripts/contextual-precision.ts]
    target:
      max_calls: 10  # Default: 50

Usage

Use createTargetClient from the SDK:

#!/usr/bin/env bun
import { createTargetClient, defineScriptGrader } from '@agentv/sdk';

export default defineScriptGrader(async ({ input, output }) => {
  const inputText = input
    .filter((message) => message.role === 'user')
    .map((message) => typeof message.content === 'string' ? message.content : '')
    .join('\n');
  const outputText = output ?? '';
  const target = createTargetClient();
  if (!target) return { pass: false, score: 0, reason: 'Target not configured' };

  const response = await target.invoke({
    question: `Is this relevant to: ${inputText}? Response: ${outputText}`,
    systemPrompt: 'Respond with JSON: { "relevant": true/false }'
  });

  const result = JSON.parse(response.rawText ?? '{}');
  return {
    pass: result.relevant === true,
    score: result.relevant === true ? 1.0 : 0.0,
    reason: result.relevant === true ? 'Response is relevant' : 'Response is not relevant',
  };
});

Use target.invokeBatch(requests) for multiple calls in parallel.

Environment variables (set automatically when target is configured):

Variable	Description
`AGENTV_TARGET_PROXY_URL`	Local proxy URL
`AGENTV_TARGET_PROXY_TOKEN`	Bearer token for authentication

Advanced Input Fields

Beyond the basic fields (input, output, expected_output), script graders receive additional structured context:

Field	Type	Description
`input`	`Message[]`	Full resolved input message array
`output`	`string \| null`	Final answer / scored result only
`messages`	`Message[]`	Transcript messages from the target execution
`expected_output`	`Message[]`	Expected/reference output messages
`output_path`	`string`	Temp file containing large final answer JSON, when `output` is omitted
`input_files`	`string[]`	Paths to input files referenced in the eval
`trace`	`Trace`	Full execution trace with messages, events, metrics, and provenance
`trace_summary`	`TraceSummary`	Lightweight execution metrics summary
`token_usage`	`{input, output}`	Token consumption
`cost_usd`	`number`	Estimated cost in USD
`duration_ms`	`number`	Total execution duration
`start_time`	`string`	ISO timestamp of first event
`end_time`	`string`	ISO timestamp of last event
`file_changes`	`string \| null`	Unified diff of workspace file changes (populated when `workspace` is configured; includes files at workspace root, changes inside nested repos, and Copilot session-state artifacts)
`workspace_path`	`string \| null`	Absolute path to the temp workspace directory (populated when `workspace` is configured)

trace_summary structure

{
  "event_count": 5,
  "tool_calls": { "search": 2, "fetch": 1 },
  "error_count": 0,
  "llm_call_count": 2
}

Field	Type	Description
`event_count`	`number`	Total tool invocations
`tool_calls`	`Record<string, number>`	Count per tool
`error_count`	`number`	Failed tool calls
`llm_call_count`	`number`	Number of LLM calls (assistant messages)

Use expected_output for reference answers and output for the actual final answer from live runs. Use messages or trace when you need tool calls, intermediate messages, or replay/provenance data.

Workspace Access

When workspace is configured in the eval YAML (via workspace.template, workspace.repos, or lifecycle hooks), script graders receive the prepared workspace path in two ways:

JSON payload: workspace_path field in the stdin input
Environment variable: AGENTV_WORKSPACE_PATH

This enables functional grading — running commands like npm test, pytest, or cargo test directly in the agent’s workspace.

What `file_changes` covers

file_changes is a unified diff built from two sources, merged in order:

Git baseline: git diff against a baseline commit taken before the agent ran. Captures edits, new files at workspace root, and changes inside any nested git repos materialized via workspace.repos or set up via a before_all hook.
Provider-reported artifacts: Copilot providers scan their session-state files/ directory after each run and append those as synthetic diffs. This surfaces files the agent wrote outside workspace_path entirely (e.g. ~/.copilot/session-state/<uuid>/files/).

Example: Deploy-and-Test Pattern

#!/usr/bin/env bun
import { readFileSync } from "fs";
import { execFileSync } from "child_process";

const input = JSON.parse(readFileSync("/dev/stdin", "utf-8"));
const cwd = input.workspace_path;

const checks: Array<{ text: string; pass: boolean; reason: string }> = [];

// Stage 1: Install dependencies
try {
  execFileSync("npm", ["install"], { cwd, stdio: "pipe" });
  checks.push({ text: "npm install", pass: true, reason: "npm install passed" });
} catch { checks.push({ text: "npm install", pass: false, reason: "npm install failed" }); }

// Stage 2: Typecheck
try {
  execFileSync("npx", ["tsc", "--noEmit"], { cwd, stdio: "pipe" });
  checks.push({ text: "typecheck", pass: true, reason: "typecheck passed" });
} catch { checks.push({ text: "typecheck", pass: false, reason: "typecheck failed" }); }

// Stage 3: Run tests
try {
  execFileSync("npm", ["test"], { cwd, stdio: "pipe" });
  checks.push({ text: "tests", pass: true, reason: "tests passed" });
} catch { checks.push({ text: "tests", pass: false, reason: "tests failed" }); }

const passed = checks.filter(check => check.pass).length;
console.log(JSON.stringify({
  pass: checks.length > 0 && passed === checks.length,
  score: checks.length > 0 ? passed / checks.length : 0,
  reason: `${passed}/${checks.length} checks passed`,
  checks,
}));

workspace:
  template: ./workspace-template   # copied into a temp dir before each run

target: my_agent

tests:
  - id: implement-feature
    input: "Implement the TODO functions in src/index.ts"
    assert:
      - Agent implements the feature correctly
      - name: functional-check
        type: script
        command: [bun, scripts/functional-check.ts]

See examples/features/functional-grading/ for a complete working example.

Examples

Example	What it demonstrates
`examples/features/functional-grading/`	`workspace_path` — deploy-and-test with `npm install` + `tsc` + `npm test`
`examples/features/file-changes/`	`file_changes` — edits, creates, and deletes captured via git baseline
`examples/features/workspace-artifact/`	`file_changes` — new file generated by agent (CSV) captured via git baseline
`examples/features/file-changes-with-repos/`	`file_changes` — workspace-root files AND changes inside nested repos both captured

Testing Locally

With `agentv eval assert`

Run a grader from .agentv/graders/ by name — no manual JSON piping required:

# Pass agent output and input directly
agentv eval assert rouge-score --agent-output "The fox jumps over the dog" --agent-input "Summarise this"

# Or pass a JSON file with { output, input } fields
agentv eval assert rouge-score --file result.json

The command:

Discovers the grader script by walking up directories looking for .agentv/graders/<name>.{ts,js,mts,mjs}
Passes { output, input, criteria } to the script via stdin
Prints the grader’s JSON result to stdout
Exits 0 if score >= 0.5, exit 1 otherwise

This is the same interface that agent-orchestrated evals use — the EVAL.yaml transpiler emits agentv eval assert instructions for script graders so external grading agents can run them directly.

With stdin pipe

Pipe JSON directly to the grader script for full control:

echo '{"input":[{"role":"user","content":"What is 2+2?"}],"input_files":[],"criteria":"4","output":"4","expected_output":[{"role":"assistant","content":"4"}]}' | python validators/check_answer.py