Handling sensitive datasets securely whilst simultaneously sharing them among teams and services is a dichotomy that many researchers have to struggle with. Whats more, developing statistical analyis code on a real dataset can bias researchers who stumble upon associations that may be by chance and not in the initial hypothesis, these spurious associations may find their way into the eventual publication. The below is an idea to help with both.
Simple English description:
- We take a dataset, strip out unessarily identifying information, scramble all the other columns so there are no associations along the rows and use that as a dummy data set as we develop the analysis for the research project. Then when we are happy with our analysis we run it once and only once on the real, unscrambled, dataset. Alternatively, (bullet point 4) before we run the final analysis we repeatedly scramble the dataset and run the analysis on it. The results of these repeated analyses should approximate the distribution of the statistical test you use! If it does not, you have probably introduced some bias in your analysis somewhere!
Technical Description:
-
Take some sensitive dataset
D
withx
rows andy
columns where each row is a subject who's identity and information (y
) we want to protect. -
Perform a function
f()
onD
wheref(D)
does the following:- Ensures any identifying and/or irrelevant columns (name, address, DOB etc)
yi
are removed. - For all remaining columns
yr
(except the index columny0
) are re-ordered randomly (shuffled).
- Ensures any identifying and/or irrelevant columns (name, address, DOB etc)
-
We then develop our analysis
A()
aroundf(D)
. This will giveA(f(D))
the following properties:- The data will be properly anonymised and un-attributable should it fall into the wrong hands.
- Any descriptive statistic of the sample
f(D)
will equal that ofD
. - Any test statistic should have
alpha
chance of being significant.
-
When the analysis code
A()
is finalised we performA(fn(D))
n
times, wheren
is a large number.- We record the test statistic over the
n
trials and confirm it approximates the test distribution. - If it does not, then it suggests that there is some mistake in
A()
that is likely to bias results. (Although it would not detect confounders etc).
- We record the test statistic over the
-
We perform the analysis
A(D)
once and publish results.