cyberneticlibrary

Measure agent reliability with formal evals

eval-harnessskillsetup L20
Sheshiyer/skill-clusters
What it does

Define pass/fail criteria and measure agent reliability with pass@k

Best for

AI-assisted workflows where deterministic pass/fail is needed before implementation.

Inputs
  • · capability/regression eval specs
  • · grader type (code/model/human)
Outputs
  • · eval report with pass@1/pass@k metrics
Requires
  • · Code graders (bash)
  • · Model graders (Claude)
  • · Human reviewer
Preconditions

Success criteria articulated before coding

Failure modes

Evals too loose/strict, slow evals skipped, regression not tracked, grader subjectivity

Trust signals
  • · pre-implementation definition phase
  • · pass^k vs pass@k distinction
  • · multi-grader support