cyberneticlibrary

Rank agent outputs by metric or judge

evalskillsetup L2★17,464

alirezarezvani/claude-skills ↗

What it does

Evaluate and rank agent results by metric or LLM judge

Best for

Comparing multiple agent outputs for correctness, simplicity, and quality to pick a winner.

Inputs

· Session ID, evaluation command (metric mode) or agent diffs (judge mode)

Outputs

· Ranked results table with winner highlighted; next-step guidance

Requires

· python scripts/result_ranker.py
· git (for diffs)
· LLM judge model

Preconditions

AgentHub session with multiple agent runs

Failure modes

· Metric command fails; agents within 10% tie (uses LLM judge)
· Missing result post

Trust signals

· Metric and LLM judge modes supported
· Delta tracking (vs baseline)
· Hybrid mode breaks ties