cyberneticlibrary

Design statistically valid experiments

experiment-designerskillsetup L2327
mohitagw15856/pm-claude-skills
What it does

Design rigorous experiments from hypotheses and interpret results with statistical and practical significance

Best for

Any decision backed by A/B test; forces pre-commitment to success criteria, prevents peeking and goalpost movement, separates statistical from practical significance.

Inputs
  • · Hypothesis (change, metric, expected lift, reason)
  • · Baseline metric value and current sample size
  • · Minimum detectable effect (MDE) and acceptable sample size
Outputs
  • · Experiment design (sample size, run duration, pre-defined success criteria)
  • · Results interpretation (statistical + practical significance, recommendation: ship/iterate/kill/follow-up)
Requires
  • · Optional: A/B testing tool (Amplitude, LaunchDarkly, VWO, Optimizely)
Preconditions
  • · Hypothesis stated as 'if we [change], we expect [metric] to [move by X%]'
  • · Baseline metric and sample size available
  • · Control and variant clearly defined
Failure modes
  • · Test stopped early (peeking problem — multiple looks inflate p-value)
  • · Success criteria moved after test runs (HARKing — hypothesizing after results known)
  • · Practical vs. statistical significance conflated (2% lift is statistically significant but not actionable)
  • · Sample ratio mismatch (assignment broken, control and variant samples imbalanced)
Trust signals
  • · Pre-defined success threshold before test runs (no moving goalposts)
  • · Design risk section flags: novelty effects, seasonal confounds, multiple testing, network effects, sample ratio mismatch
  • · Interpretation separates statistical significance (p < 0.05) from practical significance (is the lift worth shipping)
  • · Peeking check explicit: 'confirm test was not stopped early'
  • · Recommendation logic: Ship / Iterate / Kill / Follow-up with rationale