Validate A/B test results statistically

ab-test-analysisskillsetup L2★11,239

What it does

Evaluate A/B test results with statistical rigor

Best for

Data-driven product decisions when A/B test results need validation against statistical rigor and guardrail constraints before shipping.

Inputs

Outputs

· test validation summary (sample size adequacy, SRM check, novelty/primacy wash-out)
· statistical analysis: conversion rate, relative lift, p-value, 95% CI, significance flag
· guardrail metrics check (revenue, engagement, load time, etc.)
· interpretation table: Outcome → Recommendation (Ship / Extend / Stop / Investigate)
· analysis markdown: hypothesis, duration, sample, metrics table, recommendation, reasoning, next steps

Requires

Preconditions

· Raw data (CSV/Excel) OR data already normalized (conversion rates + sample sizes)
· Hypothesis clearly stated
· Primary metric identified
· Test duration specified (at least 1–2 full business cycles to avoid novelty effects)

Failure modes

· Underpowered test (sample size < 80% power) — recommendation should be Extend, not Ship
· Sample Ratio Mismatch not checked (biased randomization)
· Novelty effects not washed out (test too short, recommendation unsound)
· Guardrail metrics ignored (primary lift + degraded guardrails = ship anyway)
· Practical significance conflated with statistical significance (p < 0.05 is not enough)
· Confidence intervals not computed (no uncertainty bound)
· p-value not calculated (can't assess significance)

Trust signals

· Sample size power formula: n = (Z²α/2 × 2 × p × (1-p)) / MDE²
· SRM (Sample Ratio Mismatch) check named explicitly
· Guardrail metrics section (three examples: revenue, engagement, page load time)
· Four-outcome recommendation logic (Ship / Extend / Stop / Investigate) with decision table
· 95% confidence interval calculation (not just p-value)
· Relative lift formula: (variant - control) / control × 100
· Two-tailed z-test or chi-squared specified