cyberneticlibrary

Validate A/B test results statistically

ab-test-analysisskillsetup L211,239
phuryn/pm-skills
What it does

Evaluate A/B test results with statistical rigor

Best for

Data-driven product decisions when A/B test results need validation against statistical rigor and guardrail constraints before shipping.

Inputs
  • · A/B test CSV, Excel, or analytics export
  • · hypothesis statement
  • · variant description (what was changed)
  • · primary metric and guardrail metrics
  • · test duration
  • · traffic split (control/variant %)
Outputs
  • · test validation summary (sample size adequacy, SRM check, novelty/primacy wash-out)
  • · statistical analysis: conversion rate, relative lift, p-value, 95% CI, significance flag
  • · guardrail metrics check (revenue, engagement, load time, etc.)
  • · interpretation table: Outcome → Recommendation (Ship / Extend / Stop / Investigate)
  • · analysis markdown: hypothesis, duration, sample, metrics table, recommendation, reasoning, next steps
Requires
  • · Python pandas, numpy (for statistical calculations when raw data provided)
  • · z-test or chi-squared test (built-in statistical functions)
Preconditions
  • · Raw data (CSV/Excel) OR data already normalized (conversion rates + sample sizes)
  • · Hypothesis clearly stated
  • · Primary metric identified
  • · Test duration specified (at least 1–2 full business cycles to avoid novelty effects)
Failure modes
  • · Underpowered test (sample size < 80% power) — recommendation should be Extend, not Ship
  • · Sample Ratio Mismatch not checked (biased randomization)
  • · Novelty effects not washed out (test too short, recommendation unsound)
  • · Guardrail metrics ignored (primary lift + degraded guardrails = ship anyway)
  • · Practical significance conflated with statistical significance (p < 0.05 is not enough)
  • · Confidence intervals not computed (no uncertainty bound)
  • · p-value not calculated (can't assess significance)
Trust signals
  • · Sample size power formula: n = (Z²α/2 × 2 × p × (1-p)) / MDE²
  • · SRM (Sample Ratio Mismatch) check named explicitly
  • · Guardrail metrics section (three examples: revenue, engagement, page load time)
  • · Four-outcome recommendation logic (Ship / Extend / Stop / Investigate) with decision table
  • · 95% confidence interval calculation (not just p-value)
  • · Relative lift formula: (variant - control) / control × 100
  • · Two-tailed z-test or chi-squared specified