Rank agent outputs by metric or judge
evalskillsetup L2★17,464
alirezarezvani/claude-skills ↗What it does
Evaluate and rank agent results by metric or LLM judge
Best for
Comparing multiple agent outputs for correctness, simplicity, and quality to pick a winner.
Inputs
- · Session ID, evaluation command (metric mode) or agent diffs (judge mode)
Outputs
- · Ranked results table with winner highlighted; next-step guidance
Requires
- · python scripts/result_ranker.py
- · git (for diffs)
- · LLM judge model
Preconditions
AgentHub session with multiple agent runs
Failure modes
- · Metric command fails; agents within 10% tie (uses LLM judge)
- · Missing result post
Trust signals
- · Metric and LLM judge modes supported
- · Delta tracking (vs baseline)
- · Hybrid mode breaks ties