Skip to content

Script Graders

Script graders are scripts that evaluate agent responses deterministically. Write them in any language — Python, TypeScript, Node, or any executable.

Use script graders when you need a command-backed scoring component with explicit score control. If you only need a reusable assertion type that can be referenced by name from .agentv/assertions/, use Custom Assertions instead.

Script graders receive eval context via stdin JSON and return a result via stdout.

Input (stdin, raw wire format):

{
"input": [{ "role": "user", "content": "What is 15 + 27?" }],
"input_files": [],
"output": "The answer is 42.",
"expected_output": [{ "role": "assistant", "content": "42" }],
"messages": [{ "role": "assistant", "content": "The answer is 42." }],
"trace_summary": {
"event_count": 1,
"tool_calls": {},
"error_count": 0,
"llm_call_count": 1
}
}

Raw grader stdin is a process-boundary wire format, so keys are snake_case. TypeScript and JavaScript graders that use @agentv/sdk receive the same payload converted to camelCase. The repo-local Python helper in examples/features/sdk-python/ keeps the same snake_case field names.

Raw stdin keyTypeScript SDK fieldMeaning
outputoutputFinal answer / scored result as a string
messagesmessagesTranscript messages for transcript-aware graders
expected_outputexpectedOutputReference answer messages
output_pathoutputPathTemp file containing large final answer JSON, when used
trace_summarytraceSummaryLightweight metrics summary
token_usagetokenUsageToken usage metrics
cost_usdcostUsdEstimated cost in USD
duration_msdurationMsTotal execution duration
workspace_pathworkspacePathTemp workspace path, when configured

Do not treat output as a message array. Use output for answer-text checks, and use messages, trace.messages, or trace.events only when the grader intentionally evaluates transcript or tool behavior.

Emit a JSON object for numeric scores or multi-aspect results:

{
"pass": true,
"score": 1.0,
"reason": "Answer contains the correct value.",
"checks": [
{ "text": "Answer contains correct value (42)", "pass": true, "reason": "42 appears in the output." }
]
}
Output FieldTypeDescription
passbooleanAggregate pass/fail decision
scorenumber0.0 to 1.0
reasonstringExplanation for the aggregate decision
checksArray<{ text, pass, score?, reason, evidence? }>Optional per-aspect results with verdict, optional score, reason, and evidence

For simple pass/fail checks, skip the JSON protocol entirely. The exit code determines the score and stdout becomes the check text:

Exit codeScoreVerdict
01.0pass
non-zero (no stderr)0.0fail
#!/bin/bash
# check-pages.sh — passes when PDF has at least 5 pages
pages=$(pdfinfo report.pdf | grep Pages | awk '{print $2}')
if [ "$pages" -ge 5 ]; then
echo "PDF has $pages pages (≥5 required)"
else
echo "PDF has only $pages pages (<5 required)"
exit 1
fi
assert:
- type: script
command: [bash, scripts/check-pages.sh]

Silent one-liners work too — stdout is optional:

assert:
- type: script
command: ["bash", "-c", "[ $(wc -l < output.txt) -ge 10 ]"]

Scripts that write to stderr and exit non-zero surface as execution errors rather than quality failures.

This version uses the raw stdin/stdout contract and works in any Python environment:

validators/check_answer.py
import json, sys
data = json.load(sys.stdin)
output = data.get("output") or ""
checks = []
if "42" in output:
checks.append({"text": "Output contains correct value (42)", "pass": True, "reason": "42 appears in the output"})
else:
checks.append({"text": "Output contains correct value (42)", "pass": False, "reason": "42 is missing from the output"})
passed = sum(1 for check in checks if check["pass"])
score = passed / len(checks) if checks else 0.0
print(json.dumps({
"pass": passed == len(checks),
"score": score,
"reason": f"{passed}/{len(checks)} checks passed",
"checks": checks,
}))

The repo-local helper in examples/features/sdk-python/ wraps the same contract for that example checkout:

from agentv_py.grader import Check, ScriptGraderResult, define_script
def evaluate(context):
candidate = context.output or ""
passed = "42" in candidate
return ScriptGraderResult(
pass_=passed,
score=1.0 if passed else 0.0,
reason="Answer contains the correct value" if passed else "Answer is missing the correct value",
checks=[
Check(
text="Output contains correct value (42)",
pass_=passed,
reason="42 appears in the output" if passed else "42 is missing from the output",
)
],
)
if __name__ == "__main__":
define_script(evaluate)

Deprecated wire aliases like output_text, input_text, reference_answer, and expected_output_text are not accepted by the Python helper.

validators/check_answer.ts
import { readFileSync } from "fs";
const data = JSON.parse(readFileSync("/dev/stdin", "utf-8"));
const output: string = data.output ?? "";
const checks: Array<{ text: string; pass: boolean; reason: string }> = [];
if (output.includes("42")) {
checks.push({ text: "Output contains correct value (42)", pass: true, reason: "42 appears in the output" });
} else {
checks.push({ text: "Output contains correct value (42)", pass: false, reason: "42 is missing from the output" });
}
const passed = checks.filter(check => check.pass).length;
console.log(JSON.stringify({
pass: passed === checks.length,
score: checks.length === 0 ? 0.0 : passed / checks.length,
reason: `${passed}/${checks.length} checks passed`,
checks,
}));
assert:
- name: my_validator
type: script
command: [./validators/check_answer.py]

The @agentv/sdk package provides a declarative API with automatic stdin/stdout handling. Use defineScriptGrader to skip protocol boilerplate:

#!/usr/bin/env bun
import { defineScriptGrader } from '@agentv/sdk';
export default defineScriptGrader(({ output, criteria }) => {
const outputText = output ?? '';
const checks: Array<{ text: string; pass: boolean; reason: string }> = [];
if (outputText.includes(criteria)) {
checks.push({ text: 'Output matches expected outcome', pass: true, reason: 'Criteria text appears in the output.' });
} else {
checks.push({ text: 'Output matches expected outcome', pass: false, reason: 'Criteria text is missing from the output.' });
}
const passed = checks.filter(check => check.pass).length;
return {
pass: checks.length > 0 && passed === checks.length,
score: checks.length === 0 ? 0 : passed / checks.length,
reason: `${passed}/${checks.length} checks passed`,
checks,
};
});

For deterministic workspace checks, prefer a normal Vitest verifier file. This matches the common hidden-verifier pattern: read files from the prepared workspace and use expect(...).

graders/welcome-banner.test.ts
import { readFileSync } from 'node:fs';
import { join } from 'node:path';
import { describe, expect, it } from 'vitest';
function readWorkspaceFile(relativePath: string) {
return readFileSync(join(process.env.AGENTV_WORKSPACE_PATH ?? process.cwd(), relativePath), 'utf8');
}
describe('welcome banner', () => {
const page = () => readWorkspaceFile('app/page.tsx');
it('shows ready status text', () => {
expect(page()).toContain('Status: All systems ready');
});
it('links the call to action to /dashboard', () => {
expect(page()).toMatch(/href=["']\/dashboard["']/);
});
});

Then use AgentV’s built-in Vitest adapter as the script command. The adapter copies verifier files into a temporary workspace-local path when needed, runs Vitest in workspace_path, reads the JSON reporter output, and maps each test outcome to an AgentV check:

assert:
- name: vitest-welcome-banner
type: script
command: [agentv, eval, graders/welcome-banner.test.ts]

AgentV infers the Vitest adapter for verifier-looking files such as *.test.ts, *.spec.ts, and Vercel-style EVAL.ts. Use agentv eval vitest --in-workspace verifiers/welcome-banner.test.ts when the verifier file is already materialized inside the prepared workspace or you need other adapter options. Use the SDK’s defineVitestWorkspaceGrader() only when embedding the adapter in a custom script or custom command. See examples/features/vitest-workspace-grader/ for a runnable example.

For tiny one-off file checks, defineWorkspaceGrader can resolve the workspace path, read files relative to the workspace, build checks, and aggregate the score:

#!/usr/bin/env bun
import { defineWorkspaceGrader } from '@agentv/sdk';
export default defineWorkspaceGrader(async ({ workspace }) => [
await workspace.file('app/page.tsx').contains('Status: All systems ready'),
await workspace.file('app/page.tsx').contains('Open dashboard'),
await workspace.file('app/page.tsx').matches(/href=["']\/dashboard["']/),
await workspace.file('app/page.tsx').notMatches(/TODO/i),
]);

Prefer Vitest verifiers when the checks naturally fit expect(...). Use defineWorkspaceGrader when you need a very small custom script, custom weighting, or details that do not map cleanly to individual test outcomes.

SDK exports: defineScriptGrader, defineVitestWorkspaceGrader, defineWorkspaceGrader, Message, ToolCall, Trace, TraceSummary, ScriptGraderInput, ScriptGraderResult, ScriptGraderCheck, Workspace, WorkspaceCheck

Script graders can call an LLM through a target proxy for metrics that require multiple LLM calls (contextual precision, semantic similarity, etc.).

Add a target block to the grader config:

assert:
- name: contextual-precision
type: script
command: [bun, scripts/contextual-precision.ts]
target:
max_calls: 10 # Default: 50

Use createTargetClient from the SDK:

#!/usr/bin/env bun
import { createTargetClient, defineScriptGrader } from '@agentv/sdk';
export default defineScriptGrader(async ({ input, output }) => {
const inputText = input
.filter((message) => message.role === 'user')
.map((message) => typeof message.content === 'string' ? message.content : '')
.join('\n');
const outputText = output ?? '';
const target = createTargetClient();
if (!target) return { pass: false, score: 0, reason: 'Target not configured' };
const response = await target.invoke({
question: `Is this relevant to: ${inputText}? Response: ${outputText}`,
systemPrompt: 'Respond with JSON: { "relevant": true/false }'
});
const result = JSON.parse(response.rawText ?? '{}');
return {
pass: result.relevant === true,
score: result.relevant === true ? 1.0 : 0.0,
reason: result.relevant === true ? 'Response is relevant' : 'Response is not relevant',
};
});

Use target.invokeBatch(requests) for multiple calls in parallel.

Environment variables (set automatically when target is configured):

VariableDescription
AGENTV_TARGET_PROXY_URLLocal proxy URL
AGENTV_TARGET_PROXY_TOKENBearer token for authentication

Beyond the basic fields (input, output, expected_output), script graders receive additional structured context:

FieldTypeDescription
inputMessage[]Full resolved input message array
outputstring | nullFinal answer / scored result only
messagesMessage[]Transcript messages from the target execution
expected_outputMessage[]Expected/reference output messages
output_pathstringTemp file containing large final answer JSON, when output is omitted
input_filesstring[]Paths to input files referenced in the eval
traceTraceFull execution trace with messages, events, metrics, and provenance
trace_summaryTraceSummaryLightweight execution metrics summary
token_usage{input, output}Token consumption
cost_usdnumberEstimated cost in USD
duration_msnumberTotal execution duration
start_timestringISO timestamp of first event
end_timestringISO timestamp of last event
file_changesstring | nullUnified diff of workspace file changes (populated when workspace is configured; includes files at workspace root, changes inside nested repos, and Copilot session-state artifacts)
workspace_pathstring | nullAbsolute path to the temp workspace directory (populated when workspace is configured)
{
"event_count": 5,
"tool_calls": { "search": 2, "fetch": 1 },
"error_count": 0,
"llm_call_count": 2
}
FieldTypeDescription
event_countnumberTotal tool invocations
tool_callsRecord<string, number>Count per tool
error_countnumberFailed tool calls
llm_call_countnumberNumber of LLM calls (assistant messages)

Use expected_output for reference answers and output for the actual final answer from live runs. Use messages or trace when you need tool calls, intermediate messages, or replay/provenance data.

When workspace is configured in the eval YAML (via workspace.template, workspace.repos, or lifecycle hooks), script graders receive the prepared workspace path in two ways:

  1. JSON payload: workspace_path field in the stdin input
  2. Environment variable: AGENTV_WORKSPACE_PATH

This enables functional grading — running commands like npm test, pytest, or cargo test directly in the agent’s workspace.

file_changes is a unified diff built from two sources, merged in order:

  1. Git baseline: git diff against a baseline commit taken before the agent ran. Captures edits, new files at workspace root, and changes inside any nested git repos materialized via workspace.repos or set up via a before_all hook.
  2. Provider-reported artifacts: Copilot providers scan their session-state files/ directory after each run and append those as synthetic diffs. This surfaces files the agent wrote outside workspace_path entirely (e.g. ~/.copilot/session-state/<uuid>/files/).
#!/usr/bin/env bun
import { readFileSync } from "fs";
import { execFileSync } from "child_process";
const input = JSON.parse(readFileSync("/dev/stdin", "utf-8"));
const cwd = input.workspace_path;
const checks: Array<{ text: string; pass: boolean; reason: string }> = [];
// Stage 1: Install dependencies
try {
execFileSync("npm", ["install"], { cwd, stdio: "pipe" });
checks.push({ text: "npm install", pass: true, reason: "npm install passed" });
} catch { checks.push({ text: "npm install", pass: false, reason: "npm install failed" }); }
// Stage 2: Typecheck
try {
execFileSync("npx", ["tsc", "--noEmit"], { cwd, stdio: "pipe" });
checks.push({ text: "typecheck", pass: true, reason: "typecheck passed" });
} catch { checks.push({ text: "typecheck", pass: false, reason: "typecheck failed" }); }
// Stage 3: Run tests
try {
execFileSync("npm", ["test"], { cwd, stdio: "pipe" });
checks.push({ text: "tests", pass: true, reason: "tests passed" });
} catch { checks.push({ text: "tests", pass: false, reason: "tests failed" }); }
const passed = checks.filter(check => check.pass).length;
console.log(JSON.stringify({
pass: checks.length > 0 && passed === checks.length,
score: checks.length > 0 ? passed / checks.length : 0,
reason: `${passed}/${checks.length} checks passed`,
checks,
}));
suite.yaml
workspace:
template: ./workspace-template # copied into a temp dir before each run
target: my_agent
tests:
- id: implement-feature
input: "Implement the TODO functions in src/index.ts"
assert:
- Agent implements the feature correctly
- name: functional-check
type: script
command: [bun, scripts/functional-check.ts]

See examples/features/functional-grading/ for a complete working example.

ExampleWhat it demonstrates
examples/features/functional-grading/workspace_path — deploy-and-test with npm install + tsc + npm test
examples/features/file-changes/file_changes — edits, creates, and deletes captured via git baseline
examples/features/workspace-artifact/file_changes — new file generated by agent (CSV) captured via git baseline
examples/features/file-changes-with-repos/file_changes — workspace-root files AND changes inside nested repos both captured

Run a grader from .agentv/graders/ by name — no manual JSON piping required:

Terminal window
# Pass agent output and input directly
agentv eval assert rouge-score --agent-output "The fox jumps over the dog" --agent-input "Summarise this"
# Or pass a JSON file with { output, input } fields
agentv eval assert rouge-score --file result.json

The command:

  1. Discovers the grader script by walking up directories looking for .agentv/graders/<name>.{ts,js,mts,mjs}
  2. Passes { output, input, criteria } to the script via stdin
  3. Prints the grader’s JSON result to stdout
  4. Exits 0 if score >= 0.5, exit 1 otherwise

This is the same interface that agent-orchestrated evals use — the EVAL.yaml transpiler emits agentv eval assert instructions for script graders so external grading agents can run them directly.

Pipe JSON directly to the grader script for full control:

Terminal window
echo '{"input":[{"role":"user","content":"What is 2+2?"}],"input_files":[],"criteria":"4","output":"4","expected_output":[{"role":"assistant","content":"4"}]}' | python validators/check_answer.py