Monte-Carlo Permutation Testing

Near the end of my first statistics class, I was introduced to the concept of a hypothesis test: given two sample sets, determine the probability that they were drawn from the same population. If this is less than your desired p-value (typically 5% or less, depending on your field), you can reject the null-hypothesis and accept the alternative hypothesis that the two samples are indeed from different populations.

This was presented to me in the context of social sciences, but it comes up in programming quite often as well, whether you're a/b-testing a user interface or just want to compare benchmark results properly.

I was taught one of the most common methods for doing this: the t-test. This is a parametric test, because it works by estimating and comparing the parameters of a normal distribution for both samples. That is, it only works properly when the sampled variable is normally distributed. If you don't know that for sure, you first need to run a different kind of test, and if it's not, you need a different testing method altogether.

Non-parametric tests are different (duh!). They work with variables of any distribution, although this comes with a few disadvantages. For one, they typically have lower test power, meaning the observed difference (or the sample set) needs to be larger for them to reach the same confidence as a parametric test. Generally, they're also more computationally expensive.

The Permutation Test

One such non-parametric test is the permutation test.

Given a sample of total size n, divided into two groups of n1 and n2 variates, respectively (so n = n1 + n2), we should first note that there's n! permutations of all values; the order or samples in each group doesn't matter, so dividing by n1!*n2! gives us the total number of possibilities to divide the sample into groups of those sizes.

Only one such configuration will have all values in one group be less than all values in the other group, so if we detect this case, we know the probability of the samples being from the same distribution is the inverse of that number, P = (n1!*n2!) / n!.

We can generalize this idea for less clear-cut samples. We'll use the sum of all values in the smaller of the two sample sets as a test statistic: T = (n1 <= n2) ? sum(Group1) : sum(Group2). We can now ask for the probability of T being this large or larger if we were to assign values to groups randomly, a question we can answer exactly by evaluating T for all n! / (n1!*n2!) possible configurations:

/* One-sided permutation test for `H0 : u1 <= u2` vs. `H1 : m1 > m2`.
 */
double perm_test(int n1, int n, int32_t * data) {
    assert(data != NULL);
    assert(0 < n1 && n1 < n);
    
    int64_t T = 0;
    uint64_t z;    // number of configurations with T >= T_orig
    
    int n2 = n - n1;
    if (n1 <= n2) {
        for (int i = 0; i < n1; ++i) T += data[i];
        z = perm_test_rec(n1, n, data, T, 0);
    } else {
        for (int i = n1; i < n; ++i) T += data[i];
        z = perm_test_rec(n2, n, data, T, 0);
    }
    
    uint64_t num_configs = fact(n) / (fact(n1) * fact(n2));
    return (double)z / (double)num_configs;
}

static uint64_t perm_test_rec(int ni, int n, int32_t data,
                              int64_t T_orig, int64_t T_part)
{
    if (ni == 0)  return (T_part >= T_orig);
    
    uint64_t z = 0;
    for (int i = 0; i < (n - ni + 1); ++i) {
        z += perm_test_rec(ni - 1, n - i, data + i,
                           T_orig, T_part + data[i]);
    }
    
    return z;
}

I really like this pattern of recursion for dealing with combinatorics, for some reason.

In case you're worried about the run time of this code: don't be. For n1 = n2 = 10, this will evaluate the base case of the helper function less than 200,000 times. A modern CPU will crunch through the few million required operations in a few milliseconds, tops.

There's something way worse about this code than its performance: when n = 21, fact(n) actually overflows, meaning this won't even give correct results! Of course, we can pull some tricks to increase the critical value of n:

uint64_t num_configs = 1;
int j = 2;

// early-exit the computation rather than dividing stuff out later
for (int i = max(n1, n2) + 1; i <= n; ++i) {
    num_configs *= i;
    
    // Weave in divisions by factors of fact(min(n1,n2)) when possible
    if (j <= min(n1, n2) && num_configs % j == 0) {
        num_configs /= j++;
    }
}

// Perform remaining divisions
for (; j <= min(n1, n2); ++j)  num_configs /= j;

I actually used Rust to test this algorithm because of its built-in overflow checking - the first time I actually wanted that for unsigned integers, ever.

With a bit of number theory and ingenuity, you can probably get rid of the check for num_configs % j == 0 somehow, maybe even the "remaining divisions" loop, but I don't think you can postpone overflow much further than this; this will actually handle values up to n1 = 30, n2 = 32 correctly, after which we might as well give up, since even the correct result values will soon no longer fit into a 64-bit integer.

Alternatively, you could just count the number of permutations you evaluated in perm_test_rec, but that's not a viable way to calculate this when you don't want to enumerate them all anyway, as we'll do later on.

For now, this is more than enough to start worrying about the test completing before you're old and gray.

Handling Larger Samples with Randomization

Rather than evaluating T for billions or trillions of combinations, we can instead estimate the value of z by evaluating it for some number M of random configurations. Luckily, we already know how to pick those fairly:

#define M 10000

double perm_test(int n1, int n, int32_t * data) {
    assert(data != NULL);
    assert(0 < n1 && n1 < n);
    
    int64_t T = 0;
    
    int n2 = n - n1;
    int ni;
    if (n1 <= n2) {
        ni = n1; for (int i = 0; i < n1; ++i) T += data[i];
    } else {
        ni = n2; for (int i = n1; i < n; ++i) T += data[i];
    }
    
    int32_t * sample_config = (int32_t *) malloc(ni * sizeof(int32_t));
    uint64_t z = 0;    // number of configurations with T >= T_orig
    for (int i = 0; i < M; ++i) {
        rand_select(data, n, sample_config, ni, sizeof(int32_t));
        
        int64_t t = 0;
        for (int j = 0; j < ni; ++j) t += sample_config[j];
        
        if (t >= T) z++;
    }
    
    free(sample_config);
    return (double)z / (double)M;
}

Note, however, that we have introduced a few new sources of possible errors just now. Most importantly, we might just get terribly unlucky and draw an unrepresentative sequence of configurations, possibly yielding results that are much too high or too low. There isn't much we can do about that - it's in the nature of randomized algorithms, after all. If you're a bit paranoid (like I initially was) you can attempt some kind of correction, rather than just returning z / M.

Standard practice seems to be to simply ignore this, since for large enough M, even a large confidence interval is negligibly small.

We can, however, dynamically switch to the exact method depending on N:

double perm_test(int n1, int n, int32 t * data) {
    assert(data != NULL);
    assert(0 < n1 && n1 < n);
    
    int64_t T = 0;
    
    int n2 = n - 1;
    int ni;
    if (n1 <= n2) {
        ni = n1; for (int i = 0; i < n1; ++i) T += data[i];
    } else {
        ni = n2; for (int i = n1; i < n; ++i) T += data[i];
    }
    
    if (ni < 32) {
        uint64_t num_configs = 1; {
            int j = 2;

            // early-exit the computation rather than dividing stuff out later
            for (int i = max(n1, n2) + 1; i <= n; ++i) {
                num_configs *= i;

                // Weave in divisions by factors of fact(min(n1,n2)) when possible
                if (j <= min(n1, n2) && num_configs % j == 0) {
                    num_configs /= j++;
                }
            }

            // Perform remaining divisions
            for (; j <= min(n1, n2); ++j)  num_configs /= j;
        }
        
        if (num_configs < M) {
            uint64_t z = perm_test_rec(ni, n, data, T, 0);
            return (double)z / (double)num_configs;
        }
    }
    
    int32_t * sample_config = (int32_t *) malloc(ni * sizeof(int32_t));
    uint64_t z = 0;    // number of configurations with T >= T_orig
    for (int i = 0; i < M; ++i) {
        rand_select(data, n, sample_config, ni, sizeof(int32_t));

        int64_t t = 0;
        for (int j = 0; j < ni; ++j) t += sample_config[j];

        if (t >= T) z++;
    }

    free(sample_config);
    double p_estimate = (double)z / (double)M;

    /* 
     * Optional correction ...
     */ 

    return p_estimate;
}

Application Notes

Test Prerequisites

While non-parametric tests are more widely applicable than parametric ones, you still need to be careful with what data you use this on, and what you can conclude from the results. For example, we shouldn't use this test to compare the benchmark results of different algorithms from my last post; each measurement comes from a distinct distribution, because each one was measured under distinct test conditions.

For a conclusive test, we might either change our method of measurement (e.g. recording the total time taken for all test cases, and doing that a few dozen times), or transform the data somehow (e.g. by choosing one of the algorithms as the "control group" and dividing each measurement by the corresponding measurement from that group).

You need to make sure samples are independent; you can use a very similar kind of test for measurements of dependent variables. And of course, the usual rules of data collection still apply: make sure you're measuring what you think you're measuring, and be mindful of selection bias.

Also note that repeated testing bloats your probability of error; say you want to compare some number of groups to a common control group. If you do a simple test like this for each comparison, each one with probability of error p, you're looking at a probability of 1 - pow(1 - p, n) that at least one of those gives a wrong result. You can either adjust your critical value of p to account for that, or reach for a fancier test. (This is true for all hypothesis tests of this kind; you would do an ANOVA where you might otherwise do multiple t-tests, for example.)

Computing the Test Statistic

There's another reason the code as shown above wouldn't work with that data: the measurements are 64-bit integers, because cycle counts can easily go into the hundreds of billions. For large n, we can expect this to overflow T.

Luckily, algebra tells us that k * T, k > 0 maintains the ordering of T, so with 0 < k < 1, we'll lose some precision, but will at least still give a good approximation of the correct result. Since T = x_1 + x_2 + ... + x_n, and multiplication distributes, we can instead calculate kT = k*x_1 + k*x_2 + ... + k*x_n and thus avoid overflow. We can gain back some accuracy by complicating the code a bit and going with kT = k(x_1 + ... + x_m) + k(x_(m + 1) + ... + x_2m) + ... + k( ... + x_n). We can derive suitable values of k and m from the maximum measurement:

uint64_t T = 0;
uint64_t max = 0;

for (int i = 0; i < n; ++i) if (data[i] > max) max = data[i];

// ilog2 : index of highest set bit
int m = 1 << (63 - ilog2(max));
int k = 1 + ilog2(ni));

int i = 0, base = (n1 <= n2) ? 0 : n1;
for (; i < ni; i += m) {
    uint64_t dT = 0;
    for (int j = i; j < i + m; ++j) dT += data[base + j];
    T += dT >> k;
}
for (; i < ni; i += 1) T += data[base + j] >> k;

You could also do the calculation in floating point, though you'd probably still want to do it in batches like this to minimize rounding error.

Two-Sided Hypothesis Testing

In some cases, you may not have a hypothesis that group 1 has a higher (or lower) average than group 2, and may just want to test whether there's any kind of significant difference between the two at all. In those cases, you need to also count permutations where T >= T_critical, but also those where T <= S - T_critical, where S is the test statistic computed over both groups combined. This does not count cases where the groups are switched around, so return 2.0 * (double)z / (double)num_configs.

The necessary changes to the randomized test are analogous.

If you want to find out more about this kind of approach to statistics, I'd recommend the 30 minute talk Statistics for Hackers by Jake Vanderplas as an introduction, and the free online book Resampling: The New Statistics by Julian L. Simon as further reading.

teryror/monte_carlo_permutation_testing.md