Validate A/B test results statistically
analyze-testcommandsetup L2★11,239
phuryn/pm-skills ↗Causal-lift measurements
ab-experimentation0pp vs no-skill baselinewith-skill 100% · baseline 100%
Measured by running the task with and without this artifact, K=5, graded by deterministic checks — no LLM judging.
What it does
Analyze A/B test results for significance and lift
Best for
Validating A/B test results with statistical rigor before deciding to ship a variant.
Inputs
- · Test name and context
- · Variant data (control + treatment groups, metrics)
- · Sample sizes, conversion rates, or aggregated events
Outputs
- · Statistical significance test (chi-square, t-test)
- · Lift calculation (% difference)
- · Confidence intervals
- · Actionable recommendations
Requires
- · Python (scipy, pandas) for statistical testing
Preconditions
- · A/B test data with control + treatment arms
- · Metric definition (conversion, ARPU, CTR, etc.)
- · Sample size > ~100 per arm (rule of thumb)
Failure modes
- · Small sample sizes → low power, wide confidence intervals
- · Multiple comparisons (p-hacking) may inflate false positives
- · Assumes random assignment; observational data invalid
- · Significance ≠ practical importance (small lift may be statistically sig)
Trust signals
- · Chi-square or t-test (parametric/nonparametric)
- · P-values and confidence intervals
- · Lift % and practical significance commentary
- · Power analysis + sample size adequacy check