cyberneticlibrary

Rank agent outputs by metric or judge

evalskillsetup L217,464
alirezarezvani/claude-skills
What it does

Evaluate and rank agent results by metric or LLM judge

Best for

Comparing multiple agent outputs for correctness, simplicity, and quality to pick a winner.

Inputs
  • · Session ID, evaluation command (metric mode) or agent diffs (judge mode)
Outputs
  • · Ranked results table with winner highlighted; next-step guidance
Requires
  • · python scripts/result_ranker.py
  • · git (for diffs)
  • · LLM judge model
Preconditions

AgentHub session with multiple agent runs

Failure modes
  • · Metric command fails; agents within 10% tie (uses LLM judge)
  • · Missing result post
Trust signals
  • · Metric and LLM judge modes supported
  • · Delta tracking (vs baseline)
  • · Hybrid mode breaks ties