Dataset randomisation for more secure and robust analysis.

Handling sensitive datasets securely whilst simultaneously sharing them among teams and services is a dichotomy that many researchers have to struggle with. Whats more, developing statistical analyis code on a real dataset can bias researchers who stumble upon associations that may be by chance and not in the initial hypothesis, these spurious associations may find their way into the eventual publication. The below is an idea to help with both.

Simple English description:

We take a dataset, strip out unessarily identifying information, scramble all the other columns so there are no associations along the rows and use that as a dummy data set as we develop the analysis for the research project. Then when we are happy with our analysis we run it once and only once on the real, unscrambled, dataset. Alternatively, (bullet point 4) before we run the final analysis we repeatedly scramble the dataset and run the analysis on it. The results of these repeated analyses should approximate the distribution of the statistical test you use! If it does not, you have probably introduced some bias in your analysis somewhere!

Technical Description:

Take some sensitive dataset D with x rows and y columns where each row is a subject who's identity and information (y) we want to protect.
Perform a function f() on D where f(D) does the following:
- Ensures any identifying and/or irrelevant columns (name, address, DOB etc) y_i are removed.
- For all remaining columns y_r (except the index column y₀) are re-ordered randomly (shuffled).
We then develop our analysis A() around f(D). This will give A(f(D)) the following properties:
- The data will be properly anonymised and un-attributable should it fall into the wrong hands.
- Any descriptive statistic of the sample f(D) will equal that of D.
- Any test statistic should have alpha chance of being significant.
When the analysis code A() is finalised we perform A(f_n(D)) n times, where n is a large number.
- We record the test statistic over the n trials and confirm it approximates the test distribution.
- If it does not, then it suggests that there is some mistake in A() that is likely to bias results. (Although it would not detect confounders etc).
We perform the analysis A(D) once and publish results.

HarvsG/randomise.md

Dataset randomisation for more secure and robust analysis.