Skip to content

Instantly share code, notes, and snippets.

@tomleo
Last active March 23, 2022 10:25
Show Gist options
  • Save tomleo/b011ad8db69fb8c18108 to your computer and use it in GitHub Desktop.
Save tomleo/b011ad8db69fb8c18108 to your computer and use it in GitHub Desktop.

A/B Testing

Optimizely

click goals
measures how often visitors click an element
url targeting
where the experiment runs
audiences
who sees the experiment

The setup is strait-forward, and the interface makes it easy for non-technical people to create experiments related to simple content and visual changes.

For example you can change css and copy. When changing markup a developer should probably be involved as other code or cascading styles might rely on the document structure.

Setup

you’ll need to paste that line of code into the <head> tag of any page you want to include in your experiment

Because Optimizely will actually control how your page displays, it’s important to put the snippet as high in the <head> tag as possible. This lets Optimizely load variations in real time, before the visitor even sees the page.

A/B Testing

null-Hypothesis tests

The p-value is used in the context of null hypothesis testing in order to quantify the idea of statistical significance of evidence. [2]

A common mistake is to run multiple null hypothesis tests as the data are coming in and decide to stop the test early on the the first significant result. [1]

If you run experiments: the best way to avoid repeated significance testing errors is to not test significance repeatedly. Decide on a sample size in advance and wait until the experiment is over before you start believing the “chance of beating original” figures that the A/B testing software gives you. [3]

Sample Size Calculator

Issues with null-hypothesis method: [4]

  • Even if preliminary evidence says that one version is terrible, we will keep losing conversions until we hit an arbitrary threshold.
  • If we hit that threshold without having reached statistical proof, we cannot continue the experiment.
  • Naive attempts to fix the former problems by using the same statistical test multiple times leads to our making far more mistakes than we are willing to accept.

A/B Split Test Significance Calculator

Bayesian A/B testing

Bayesian A/B testing is an alternative to Students T-Test (t-distributions) and obviously p-distrubutions which require large sample sizes.

  • unlike the Student T-Test, you can stop the test early if there is a clear winner or run it for longer if you need more samples. While is is generally true A/B Testing with Limited Data shows a workaround.

bayesian_ab_test.py

priors
a distribution that encodes your prior belief about the parameter-of-interest (represent what we believe before we run the test)

Advantages of Basian Testing

  1. Easier to interpret results, p-values are confusing. Try to follow A/B Testing with Limited Data without your brain melting
  2. "measuring the probability at time t that B is better than A (or vice versa). You can look at the data, check if the test is finished, and stop the test early if the result is highly conclusive." [5]
  3. You can use your current posteriors as new priors for what is essentially the start of a new test without any major interruptions in your development flow. [5] This is the probably the worst thing you can do with traditional hypothesis testing.
  4. Bayesian A/B test achieves the same lift as the standard procedure, but typically uses fewer data points. [5]

TODO Re-learn Calculus

Test Procedure

CTR
click through rate
MED
minimum detectable effect size
significance/alpha
Determine % of time a difference will be detected, assuming one does not exsit
power/beta
% of time the MDE will be detected
impression
number of times X is exposed to a potential viewer
lift
is a measure of the performance of a targeting model (association rule) at predicting or classifying cases as having an enhanced response (with respect to the population as a whole), measured against a random choice targeting model.
likelihood
a function that encodes how likely your data is given a range of possible parameters
posterior
a distribution of the parameter-of-interest given your data, combining the prior and likelihood
  • stop if P(CTR_B > CTR_A) is below a threshold.
  • if P(CTR_B < CTR_A) = 0.5 then the banners are the same
  • run the test until the expected cost drops below a threshold

Calculations in Python

[6]

from numpy.random import beta as beta_dist
import numpy as np
N_samp = 10000 # number of samples to draw
clicks_A = 450 # insert your own data here
views_A = 56000
clicks_B = 345 # ditto
views_B = 49000
alpha = 1.1 # just for the example - set your own!
beta = 14.2
A_samples = beta_dist(clicks_A+alpha, views_A-clicks_A+beta, N_samp)
B_samples = beta_dist(clicks_B+alpha, views_B-clicks_B+beta, N_samp)

np.mean(A_samples > B_samples) # posterior probability that CTR_A > CTR_B

np.mean( 100. *(A_samples - B_samples)/B_samples > 3 ) # probability that the lift of A relative to B is at least 3%

# You can set alpha and beta priors to 1 when you belive all values are equally likely
[1]http://ewulczyn.github.io/How_Naive_AB_Testing_Goes_Wrong/
[2]http://en.wikipedia.org/wiki/P-value
[3]http://www.evanmiller.org/how-not-to-run-an-ab-test.html
[4]http://elem.com/~btilly/ab-testing-multiple-looks/part1-rigorous.html
[5](1, 2, 3) http://www.bayesianwitch.com/blog/2014/bayesian_ab_test.html
[6]http://engineering.richrelevance.com/bayesian-ab-tests/
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment