Select best output by behavior testing
execution-grounded-selectionskillsetup L3★64
Tibsfox/gsd-skill-creator ↗What it does
Pick best candidate via execution fingerprinting instead of output voting
Best for
Code selection when semantic voting's 19-52pp improvement over output voting matters
Inputs
- · N candidate outputs (code/config/plans)
- · diverse test inputs
Outputs
- · selected candidate
- · behavioral fingerprint cluster
Requires
- · test harness
- · execution runtime
Preconditions
Multiple candidates at temperature > 0, execution is feasible and side-effect-free
Failure modes
Expensive execution timeout, deterministic generation (no diversity), crash-as-distinct-fingerprint conflation
Trust signals
- · arxiv 2605.08680v1 (Semantic Voting benchmark)
- · sketch-generated inputs beat fuzz by 11pp