Measure agent reliability with formal evals
eval-harnessskillsetup L2★0
Sheshiyer/skill-clusters ↗What it does
Define pass/fail criteria and measure agent reliability with pass@k
Best for
AI-assisted workflows where deterministic pass/fail is needed before implementation.
Inputs
- · capability/regression eval specs
- · grader type (code/model/human)
Outputs
- · eval report with pass@1/pass@k metrics
Requires
- · Code graders (bash)
- · Model graders (Claude)
- · Human reviewer
Preconditions
Success criteria articulated before coding
Failure modes
Evals too loose/strict, slow evals skipped, regression not tracked, grader subjectivity
Trust signals
- · pre-implementation definition phase
- · pass^k vs pass@k distinction
- · multi-grader support