Audit skill impact with paired testing

skill-counterfactual-auditskillsetup L364
Tibsfox/gsd-skill-creator
What it does

Audit skill behavior via paired probe with and without skill loaded

Best for

Detect when a skill changes behavior below pass-rate threshold—surface anchoring, template copy, excess planning, off-task drift.

Inputs
  • · probe_task_bank
  • · skill_name
  • · task_descriptions with phases
Outputs
  • · SIP report markdown
  • · phase_comparison table
  • · retire/refine/keep recommendation
Preconditions
  • · Skill has been active ≥3 sessions
  • · Probe-task bank curated (3-5 tasks)
  • · Phase decomposition rules defined
Failure modes
  • · Workflow guides (loop, schedule) trivially show task-recovery
  • · No baseline if skill just created
Trust signals
  • · Based on arxiv 2605.11946v1 (CTA)
  • · Detects 522 behavioral changes while pass-rate moves +0.3%