cyberneticlibrary

Compare specialist agents head-to-head

«family»-agent-headtoheadworkflowsetup L32
samjmarshall/rekurve
What it does

Compare agents within a family on tasks

Best for

Benchmarking agent variants when you need head-to-head evaluation on a fixed task set.

Inputs
  • · family (agent variants)
  • · task_set (benchmark tasks)
Outputs
  • · scores per agent per task; ranked results
Requires
  • · parallel execution
  • · scoring agent
Preconditions

Agents must be deployable; task set must be fixed and reproducible.

Failure modes

Agent fails to complete task; scoring inconsistent; task set too small for statistical significance.