cyberneticlibrary

Validate A/B test results statistically

analyze-testcommandsetup L2★11,239

phuryn/pm-skills ↗

Causal-lift measurements

ab-experimentation0pp vs no-skill baselinewith-skill 100% · baseline 100%

Measured by running the task with and without this artifact, K=5, graded by deterministic checks — no LLM judging.

What it does

Analyze A/B test results for significance and lift

Best for

Validating A/B test results with statistical rigor before deciding to ship a variant.

Inputs

· Test name and context
· Variant data (control + treatment groups, metrics)
· Sample sizes, conversion rates, or aggregated events

Outputs

· Statistical significance test (chi-square, t-test)
· Lift calculation (% difference)
· Confidence intervals
· Actionable recommendations

Requires

· Python (scipy, pandas) for statistical testing

Preconditions

· A/B test data with control + treatment arms
· Metric definition (conversion, ARPU, CTR, etc.)
· Sample size > ~100 per arm (rule of thumb)

Failure modes

· Small sample sizes → low power, wide confidence intervals
· Multiple comparisons (p-hacking) may inflate false positives
· Assumes random assignment; observational data invalid
· Significance ≠ practical importance (small lift may be statistically sig)

Trust signals

· Chi-square or t-test (parametric/nonparametric)
· P-values and confidence intervals
· Lift % and practical significance commentary
· Power analysis + sample size adequacy check

Capability

ab-experimentation → compare alternatives