Near the end of my first statistics class, I was introduced to the concept of a hypothesis test: given two sample sets, determine the probability that they were drawn from the same population. If this is less than your desired p-value (typically 5% or less, depending on your field), you can reject the null-hypothesis and accept the alternative hypothesis that the two samples are indeed from different populations.
This was presented to me in the context of social sciences, but it comes up in programming quite often as well, whether you're a/b-testing a user interface or just want to compare benchmark results properly.
I was taught one of the most common methods for doing this: the t-test. This is a parametric test, because it works by estimating and comparing the parameters of a normal distribution for both samples. That is, it only works properly when the sampled variable is normally distributed. If you don't know that for sure, you first need to run a different kind of test, and if it's not, you need a different testing method altogether.
Non-parametric tests are different (duh!). They work with variables of any distribution, although this comes with a few disadvantages. For one, they typically have lower test power, meaning the observed difference (or the sample set) needs to be larger for them to reach the same confidence as a parametric test. Generally, they're also more computationally expensive.
One such non-parametric test is the permutation test.
Given a sample of total size n
, divided into two groups of n1
and n2
variates,
respectively (so n = n1 + n2
), we should first note that there's n!
permutations
of all values; the order or samples in each group doesn't matter, so dividing by
n1!*n2!
gives us the total number of possibilities to divide the sample into groups
of those sizes.
Only one such configuration will have all values in one group be less than all values
in the other group, so if we detect this case, we know the probability of the samples
being from the same distribution is the inverse of that number, P = (n1!*n2!) / n!
.
We can generalize this idea for less clear-cut samples. We'll use the sum of all
values in the smaller of the two sample sets as a test statistic: T = (n1 <= n2) ? sum(Group1) : sum(Group2)
. We can now ask for the probability of T
being this
large or larger if we were to assign values to groups randomly, a question we can
answer exactly by evaluating T
for all n! / (n1!*n2!)
possible configurations:
/* One-sided permutation test for `H0 : u1 <= u2` vs. `H1 : m1 > m2`.
*/
double perm_test(int n1, int n, int32_t * data) {
assert(data != NULL);
assert(0 < n1 && n1 < n);
int64_t T = 0;
uint64_t z; // number of configurations with T >= T_orig
int n2 = n - n1;
if (n1 <= n2) {
for (int i = 0; i < n1; ++i) T += data[i];
z = perm_test_rec(n1, n, data, T, 0);
} else {
for (int i = n1; i < n; ++i) T += data[i];
z = perm_test_rec(n2, n, data, T, 0);
}
uint64_t num_configs = fact(n) / (fact(n1) * fact(n2));
return (double)z / (double)num_configs;
}
static uint64_t perm_test_rec(int ni, int n, int32_t data,
int64_t T_orig, int64_t T_part)
{
if (ni == 0) return (T_part >= T_orig);
uint64_t z = 0;
for (int i = 0; i < (n - ni + 1); ++i) {
z += perm_test_rec(ni - 1, n - i, data + i,
T_orig, T_part + data[i]);
}
return z;
}
I really like this pattern of recursion for dealing with combinatorics, for some reason.
In case you're worried about the run time of this code: don't be. For n1 = n2 = 10
,
this will evaluate the base case of the helper function less than 200,000 times.
A modern CPU will crunch through the few million required operations in a few
milliseconds, tops.
There's something way worse about this code than its performance: when n = 21
,
fact(n)
actually overflows, meaning this won't even give correct results!
Of course, we can pull some tricks to increase the critical value of n
:
uint64_t num_configs = 1;
int j = 2;
// early-exit the computation rather than dividing stuff out later
for (int i = max(n1, n2) + 1; i <= n; ++i) {
num_configs *= i;
// Weave in divisions by factors of fact(min(n1,n2)) when possible
if (j <= min(n1, n2) && num_configs % j == 0) {
num_configs /= j++;
}
}
// Perform remaining divisions
for (; j <= min(n1, n2); ++j) num_configs /= j;
I actually used Rust to test this algorithm because of its built-in overflow checking - the first time I actually wanted that for unsigned integers, ever.
With a bit of number theory and ingenuity, you can probably get rid of the check
for num_configs % j == 0
somehow, maybe even the "remaining divisions" loop,
but I don't think you can postpone overflow much further than this; this will
actually handle values up to n1 = 30, n2 = 32
correctly, after which we might
as well give up, since even the correct result values will soon no longer fit
into a 64-bit integer.
Alternatively, you could just count the number of permutations you evaluated in
perm_test_rec
, but that's not a viable way to calculate this when you don't
want to enumerate them all anyway, as we'll do later on.
For now, this is more than enough to start worrying about the test completing before you're old and gray.
Rather than evaluating T
for billions or trillions of combinations, we can instead
estimate the value of z
by evaluating it for some number M
of random
configurations. Luckily, we already know how to pick those fairly:
#define M 10000
double perm_test(int n1, int n, int32_t * data) {
assert(data != NULL);
assert(0 < n1 && n1 < n);
int64_t T = 0;
int n2 = n - n1;
int ni;
if (n1 <= n2) {
ni = n1; for (int i = 0; i < n1; ++i) T += data[i];
} else {
ni = n2; for (int i = n1; i < n; ++i) T += data[i];
}
int32_t * sample_config = (int32_t *) malloc(ni * sizeof(int32_t));
uint64_t z = 0; // number of configurations with T >= T_orig
for (int i = 0; i < M; ++i) {
rand_select(data, n, sample_config, ni, sizeof(int32_t));
int64_t t = 0;
for (int j = 0; j < ni; ++j) t += sample_config[j];
if (t >= T) z++;
}
free(sample_config);
return (double)z / (double)M;
}
Note, however, that we have introduced a few new sources of possible errors just
now. Most importantly, we might just get terribly unlucky and draw an unrepresentative
sequence of configurations, possibly yielding results that are much too high or
too low. There isn't much we can do about that - it's in the nature of randomized
algorithms, after all. If you're a bit paranoid (like I initially was) you
can attempt some kind of correction, rather than just returning z / M
.
Standard practice seems to be to simply ignore this, since for large enough M
,
even a large confidence interval is negligibly small.
We can, however, dynamically switch to the exact method depending on N
:
double perm_test(int n1, int n, int32 t * data) {
assert(data != NULL);
assert(0 < n1 && n1 < n);
int64_t T = 0;
int n2 = n - 1;
int ni;
if (n1 <= n2) {
ni = n1; for (int i = 0; i < n1; ++i) T += data[i];
} else {
ni = n2; for (int i = n1; i < n; ++i) T += data[i];
}
if (ni < 32) {
uint64_t num_configs = 1; {
int j = 2;
// early-exit the computation rather than dividing stuff out later
for (int i = max(n1, n2) + 1; i <= n; ++i) {
num_configs *= i;
// Weave in divisions by factors of fact(min(n1,n2)) when possible
if (j <= min(n1, n2) && num_configs % j == 0) {
num_configs /= j++;
}
}
// Perform remaining divisions
for (; j <= min(n1, n2); ++j) num_configs /= j;
}
if (num_configs < M) {
uint64_t z = perm_test_rec(ni, n, data, T, 0);
return (double)z / (double)num_configs;
}
}
int32_t * sample_config = (int32_t *) malloc(ni * sizeof(int32_t));
uint64_t z = 0; // number of configurations with T >= T_orig
for (int i = 0; i < M; ++i) {
rand_select(data, n, sample_config, ni, sizeof(int32_t));
int64_t t = 0;
for (int j = 0; j < ni; ++j) t += sample_config[j];
if (t >= T) z++;
}
free(sample_config);
double p_estimate = (double)z / (double)M;
/*
* Optional correction ...
*/
return p_estimate;
}
While non-parametric tests are more widely applicable than parametric ones, you still need to be careful with what data you use this on, and what you can conclude from the results. For example, we shouldn't use this test to compare the benchmark results of different algorithms from my last post; each measurement comes from a distinct distribution, because each one was measured under distinct test conditions.
For a conclusive test, we might either change our method of measurement (e.g. recording the total time taken for all test cases, and doing that a few dozen times), or transform the data somehow (e.g. by choosing one of the algorithms as the "control group" and dividing each measurement by the corresponding measurement from that group).
You need to make sure samples are independent; you can use a very similar kind of test for measurements of dependent variables. And of course, the usual rules of data collection still apply: make sure you're measuring what you think you're measuring, and be mindful of selection bias.
Also note that repeated testing bloats your probability of error; say you want to
compare some number of groups to a common control group. If you do a simple test
like this for each comparison, each one with probability of error p
, you're looking
at a probability of 1 - pow(1 - p, n)
that at least one of those gives a wrong
result. You can either adjust your critical value of p
to account for that, or
reach for a fancier test. (This is true for all hypothesis tests of this kind; you
would do an ANOVA where you might otherwise do multiple t-tests, for example.)
There's another reason the code as shown above wouldn't work with that data: the
measurements are 64-bit integers, because cycle counts can easily go into the hundreds
of billions. For large n
, we can expect this to overflow T
.
Luckily, algebra tells us that k * T, k > 0
maintains the ordering of T
, so
with 0 < k < 1
, we'll lose some precision, but will at least still give a good
approximation of the correct result. Since T = x_1 + x_2 + ... + x_n
, and multiplication
distributes, we can instead calculate kT = k*x_1 + k*x_2 + ... + k*x_n
and thus
avoid overflow. We can gain back some accuracy by complicating the code a bit and
going with kT = k(x_1 + ... + x_m) + k(x_(m + 1) + ... + x_2m) + ... + k( ... + x_n)
.
We can derive suitable values of k
and m
from the maximum measurement:
uint64_t T = 0;
uint64_t max = 0;
for (int i = 0; i < n; ++i) if (data[i] > max) max = data[i];
// ilog2 : index of highest set bit
int m = 1 << (63 - ilog2(max));
int k = 1 + ilog2(ni));
int i = 0, base = (n1 <= n2) ? 0 : n1;
for (; i < ni; i += m) {
uint64_t dT = 0;
for (int j = i; j < i + m; ++j) dT += data[base + j];
T += dT >> k;
}
for (; i < ni; i += 1) T += data[base + j] >> k;
You could also do the calculation in floating point, though you'd probably still want to do it in batches like this to minimize rounding error.
In some cases, you may not have a hypothesis that group 1 has a higher (or lower)
average than group 2, and may just want to test whether there's any kind of significant
difference between the two at all. In those cases, you need to also count permutations
where T >= T_critical
, but also those where T <= S - T_critical
, where S
is
the test statistic computed over both groups combined. This does not count cases
where the groups are switched around, so return 2.0 * (double)z / (double)num_configs
.
The necessary changes to the randomized test are analogous.
If you want to find out more about this kind of approach to statistics, I'd recommend the 30 minute talk Statistics for Hackers by Jake Vanderplas as an introduction, and the free online book Resampling: The New Statistics by Julian L. Simon as further reading.