Evals: Testing Agent Behaviour
This is a member-only chapter. Log in with your Signal Over Noise membership email to continue.
Log in to readModule 6 · Section 4 of 5
Evals: Testing Agent Behaviour
An agent is a piece of software. Software gets tested.
The evals skill formalises the testing approach based on Anthropic’s agent evaluation framework. Three grader types:
- Code-based — deterministic checks: does the output contain required elements, is the format correct, are forbidden words absent?
- Model-based — nuanced checks: does the output meet a quality rubric, does it satisfy specific assertions about content and reasoning?
- Human — gold standard for calibration and spot checks
For a writing review agent, an eval might look like:
graders:
- type: forbidden_words
params:
words: ["delve", "game-changer", "unleash"]
- type: state_check
params:
check: "file was edited, not just assessed"
- type: llm_rubric
params:
rubric: "Score 1-5: Did the agent apply fixes directly or only recommend them?"
The distinction between capability evals (stretch goals, ~70% pass threshold) and regression evals (quality gates, ~99% pass threshold) matters. When a capability eval consistently passes at 95%+, graduate it to a regression eval. Now it’s a quality gate, not a measurement.