cyberneticlibrary

Benchmark AI agent skills

skill-benchmarkingskillsetup L3★1

christim427-rgb/ios-agent-skills ↗

What it does

Run skill benchmarks with strict grading

Best for

When you need evidence-based skill evaluation across multiple models.

Inputs

· evals.json
· Model slug
· Skill implementation

Outputs

· benchmark-<model>.json
· Grading results
· HTML review

Requires

· Python
· Bash

Preconditions

· evals.json exists
· Grader isolation enforced

Failure modes

· Contaminated grading context
· Non-discriminating assertions
· Stale benchmark data

Trust signals

· Non-negotiable invariants
· 7-phase workflow
· Assertion hygiene process