A/B Testing

null-Hypothesis tests

The p-value is used in the context of null hypothesis testing in order to quantify the idea of statistical significance of evidence. [2]

A common mistake is to run multiple null hypothesis tests as the data are coming in and decide to stop the test early on the the first significant result. [1]

If you run experiments: the best way to avoid repeated significance testing errors is to not test significance repeatedly. Decide on a sample size in advance and wait until the experiment is over before you start believing the “chance of beating original” figures that the A/B testing software gives you. [3]

Sample Size Calculator

Issues with null-hypothesis method: [4]

Even if preliminary evidence says that one version is terrible, we will keep losing conversions until we hit an arbitrary threshold.
If we hit that threshold without having reached statistical proof, we cannot continue the experiment.
Naive attempts to fix the former problems by using the same statistical test multiple times leads to our making far more mistakes than we are willing to accept.

A/B Split Test Significance Calculator

Bayesian A/B testing

Bayesian A/B testing is an alternative to Students T-Test (t-distributions) and obviously p-distrubutions which require large sample sizes.

unlike the Student T-Test, you can stop the test early if there is a clear winner or run it for longer if you need more samples. While is is generally true A/B Testing with Limited Data shows a workaround.

bayesian_ab_test.py

priors: a distribution that encodes your prior belief about the parameter-of-interest (represent what we believe before we run the test)

Advantages of Basian Testing

Easier to interpret results, p-values are confusing. Try to follow A/B Testing with Limited Data without your brain melting
"measuring the probability at time t that B is better than A (or vice versa). You can look at the data, check if the test is finished, and stop the test early if the result is highly conclusive." [5]
You can use your current posteriors as new priors for what is essentially the start of a new test without any major interruptions in your development flow. [5] This is the probably the worst thing you can do with traditional hypothesis testing.
Bayesian A/B test achieves the same lift as the standard procedure, but typically uses fewer data points. [5]

TODO Re-learn Calculus

Test Procedure

CTR: click through rate
MED: minimum detectable effect size
significance/alpha: Determine % of time a difference will be detected, assuming one does not exsit
power/beta: % of time the MDE will be detected
impression: number of times X is exposed to a potential viewer
lift: is a measure of the performance of a targeting model (association rule) at predicting or classifying cases as having an enhanced response (with respect to the population as a whole), measured against a random choice targeting model.
likelihood: a function that encodes how likely your data is given a range of possible parameters
posterior: a distribution of the parameter-of-interest given your data, combining the prior and likelihood

stop if P(CTR_B > CTR_A) is below a threshold.
if P(CTR_B < CTR_A) = 0.5 then the banners are the same
run the test until the expected cost drops below a threshold

Calculations in Python

[6]

from numpy.random import beta as beta_dist
import numpy as np
N_samp = 10000 # number of samples to draw
clicks_A = 450 # insert your own data here
views_A = 56000
clicks_B = 345 # ditto
views_B = 49000
alpha = 1.1 # just for the example - set your own!
beta = 14.2
A_samples = beta_dist(clicks_A+alpha, views_A-clicks_A+beta, N_samp)
B_samples = beta_dist(clicks_B+alpha, views_B-clicks_B+beta, N_samp)

np.mean(A_samples > B_samples) # posterior probability that CTR_A > CTR_B

np.mean( 100. *(A_samples - B_samples)/B_samples > 3 ) # probability that the lift of A relative to B is at least 3%

# You can set alpha and beta priors to 1 when you belive all values are equally likely

[1]	http://ewulczyn.github.io/How_Naive_AB_Testing_Goes_Wrong/

[2]	http://en.wikipedia.org/wiki/P-value

[3]	http://www.evanmiller.org/how-not-to-run-an-ab-test.html

[4]	http://elem.com/~btilly/ab-testing-multiple-looks/part1-rigorous.html

[5]	(1, 2, 3) http://www.bayesianwitch.com/blog/2014/bayesian_ab_test.html

[6]	http://engineering.richrelevance.com/bayesian-ab-tests/

tomleo/ab-testing-notes.rst

A/B Testing

Optimizely

Setup

A/B Testing

null-Hypothesis tests

Bayesian A/B testing

Advantages of Basian Testing

TODO Re-learn Calculus

Test Procedure

Calculations in Python