Validate A/B test results statistically
ab-test-analysisskillsetup L2★11,239
phuryn/pm-skills ↗What it does
Evaluate A/B test results with statistical rigor
Best for
Data-driven product decisions when A/B test results need validation against statistical rigor and guardrail constraints before shipping.
Inputs
- · A/B test CSV, Excel, or analytics export
- · hypothesis statement
- · variant description (what was changed)
- · primary metric and guardrail metrics
- · test duration
- · traffic split (control/variant %)
Outputs
- · test validation summary (sample size adequacy, SRM check, novelty/primacy wash-out)
- · statistical analysis: conversion rate, relative lift, p-value, 95% CI, significance flag
- · guardrail metrics check (revenue, engagement, load time, etc.)
- · interpretation table: Outcome → Recommendation (Ship / Extend / Stop / Investigate)
- · analysis markdown: hypothesis, duration, sample, metrics table, recommendation, reasoning, next steps
Requires
- · Python pandas, numpy (for statistical calculations when raw data provided)
- · z-test or chi-squared test (built-in statistical functions)
Preconditions
- · Raw data (CSV/Excel) OR data already normalized (conversion rates + sample sizes)
- · Hypothesis clearly stated
- · Primary metric identified
- · Test duration specified (at least 1–2 full business cycles to avoid novelty effects)
Failure modes
- · Underpowered test (sample size < 80% power) — recommendation should be Extend, not Ship
- · Sample Ratio Mismatch not checked (biased randomization)
- · Novelty effects not washed out (test too short, recommendation unsound)
- · Guardrail metrics ignored (primary lift + degraded guardrails = ship anyway)
- · Practical significance conflated with statistical significance (p < 0.05 is not enough)
- · Confidence intervals not computed (no uncertainty bound)
- · p-value not calculated (can't assess significance)
Trust signals
- · Sample size power formula: n = (Z²α/2 × 2 × p × (1-p)) / MDE²
- · SRM (Sample Ratio Mismatch) check named explicitly
- · Guardrail metrics section (three examples: revenue, engagement, page load time)
- · Four-outcome recommendation logic (Ship / Extend / Stop / Investigate) with decision table
- · 95% confidence interval calculation (not just p-value)
- · Relative lift formula: (variant - control) / control × 100
- · Two-tailed z-test or chi-squared specified