Select best output by behavior testing

execution-grounded-selectionskillsetup L364
Tibsfox/gsd-skill-creator
What it does

Pick best candidate via execution fingerprinting instead of output voting

Best for

Code selection when semantic voting's 19-52pp improvement over output voting matters

Inputs
  • · N candidate outputs (code/config/plans)
  • · diverse test inputs
Outputs
  • · selected candidate
  • · behavioral fingerprint cluster
Requires
  • · test harness
  • · execution runtime
Preconditions

Multiple candidates at temperature > 0, execution is feasible and side-effect-free

Failure modes

Expensive execution timeout, deterministic generation (no diversity), crash-as-distinct-fingerprint conflation

Trust signals
  • · arxiv 2605.08680v1 (Semantic Voting benchmark)
  • · sketch-generated inputs beat fuzz by 11pp