cyberneticlibrary

Validate A/B test results statistically

analyze-testcommandsetup L211,239
phuryn/pm-skills

Causal-lift measurements

ab-experimentation0pp vs no-skill baselinewith-skill 100% · baseline 100%

Measured by running the task with and without this artifact, K=5, graded by deterministic checks — no LLM judging.

What it does

Analyze A/B test results for significance and lift

Best for

Validating A/B test results with statistical rigor before deciding to ship a variant.

Inputs
  • · Test name and context
  • · Variant data (control + treatment groups, metrics)
  • · Sample sizes, conversion rates, or aggregated events
Outputs
  • · Statistical significance test (chi-square, t-test)
  • · Lift calculation (% difference)
  • · Confidence intervals
  • · Actionable recommendations
Requires
  • · Python (scipy, pandas) for statistical testing
Preconditions
  • · A/B test data with control + treatment arms
  • · Metric definition (conversion, ARPU, CTR, etc.)
  • · Sample size > ~100 per arm (rule of thumb)
Failure modes
  • · Small sample sizes → low power, wide confidence intervals
  • · Multiple comparisons (p-hacking) may inflate false positives
  • · Assumes random assignment; observational data invalid
  • · Significance ≠ practical importance (small lift may be statistically sig)
Trust signals
  • · Chi-square or t-test (parametric/nonparametric)
  • · P-values and confidence intervals
  • · Lift % and practical significance commentary
  • · Power analysis + sample size adequacy check