cyberneticlibrary

Benchmark AI agent skills

skill-benchmarkingskillsetup L31
christim427-rgb/ios-agent-skills
What it does

Run skill benchmarks with strict grading

Best for

When you need evidence-based skill evaluation across multiple models.

Inputs
  • · evals.json
  • · Model slug
  • · Skill implementation
Outputs
  • · benchmark-<model>.json
  • · Grading results
  • · HTML review
Requires
  • · Python
  • · Bash
Preconditions
  • · evals.json exists
  • · Grader isolation enforced
Failure modes
  • · Contaminated grading context
  • · Non-discriminating assertions
  • · Stale benchmark data
Trust signals
  • · Non-negotiable invariants
  • · 7-phase workflow
  • · Assertion hygiene process