Design statistically valid experiments

experiment-designerskillsetup L2★327

What it does

Design rigorous experiments from hypotheses and interpret results with statistical and practical significance

Best for

Any decision backed by A/B test; forces pre-commitment to success criteria, prevents peeking and goalpost movement, separates statistical from practical significance.

Inputs

· Hypothesis (change, metric, expected lift, reason)
· Baseline metric value and current sample size
· Minimum detectable effect (MDE) and acceptable sample size

Outputs

· Experiment design (sample size, run duration, pre-defined success criteria)
· Results interpretation (statistical + practical significance, recommendation: ship/iterate/kill/follow-up)

Requires

· Optional: A/B testing tool (Amplitude, LaunchDarkly, VWO, Optimizely)

Preconditions

· Hypothesis stated as 'if we [change], we expect [metric] to [move by X%]'
· Baseline metric and sample size available
· Control and variant clearly defined

Failure modes

· Test stopped early (peeking problem — multiple looks inflate p-value)
· Success criteria moved after test runs (HARKing — hypothesizing after results known)
· Practical vs. statistical significance conflated (2% lift is statistically significant but not actionable)
· Sample ratio mismatch (assignment broken, control and variant samples imbalanced)

Trust signals

· Pre-defined success threshold before test runs (no moving goalposts)
· Design risk section flags: novelty effects, seasonal confounds, multiple testing, network effects, sample ratio mismatch
· Interpretation separates statistical significance (p < 0.05) from practical significance (is the lift worth shipping)
· Peeking check explicit: 'confirm test was not stopped early'
· Recommendation logic: Ship / Iterate / Kill / Follow-up with rationale

Capability

ab-experimentation → compare alternatives