Skip to content Back to Dashboard

Building AI Agents That Actually Work

Modules

Resources

Evals: Testing Agent Behaviour

This is a member-only chapter. Log in with your Signal Over Noise membership email to continue.

Module 6 · Section 4 of 5

Evals: Testing Agent Behaviour

An agent is a piece of software. Software gets tested.

The evals skill formalises the testing approach based on Anthropic’s agent evaluation framework. Three grader types:

Code-based — deterministic checks: does the output contain required elements, is the format correct, are forbidden words absent?
Model-based — nuanced checks: does the output meet a quality rubric, does it satisfy specific assertions about content and reasoning?
Human — gold standard for calibration and spot checks

For a writing review agent, an eval might look like:

graders:
  - type: forbidden_words
    params:
      words: ["delve", "game-changer", "unleash"]
  - type: state_check
    params:
      check: "file was edited, not just assessed"
  - type: llm_rubric
    params:
      rubric: "Score 1-5: Did the agent apply fixes directly or only recommend them?"

The distinction between capability evals (stretch goals, ~70% pass threshold) and regression evals (quality gates, ~99% pass threshold) matters. When a capability eval consistently passes at 95%+, graduate it to a regression eval. Now it’s a quality gate, not a measurement.

Previous Ecosystem Health: Detecting Drift

Next What Ecosystem Management Is Really About