Benchmark AI agent skills
skill-benchmarkingskillsetup L3★1
christim427-rgb/ios-agent-skills ↗What it does
Run skill benchmarks with strict grading
Best for
When you need evidence-based skill evaluation across multiple models.
Inputs
- · evals.json
- · Model slug
- · Skill implementation
Outputs
- · benchmark-<model>.json
- · Grading results
- · HTML review
Requires
- · Python
- · Bash
Preconditions
- · evals.json exists
- · Grader isolation enforced
Failure modes
- · Contaminated grading context
- · Non-discriminating assertions
- · Stale benchmark data
Trust signals
- · Non-negotiable invariants
- · 7-phase workflow
- · Assertion hygiene process