Geospatial privacy initial cheatsheet
There's more to it than this but this is a decent starting point.
- If there are personal identifiers in the dataset, the safest approach is to remove them entirely. If you need to keep them in to enable analytics on an 'anonymous user over time', make sure that you don't use a reversible technique like MD5 hashing. See the NYC Taxicab debacle for an example. 2. If you use hashes, use a long salt value and a cryptographically okay hash like SHA512. 3. Or randomize data order and assign serial (increasing integer) numbers to identifiers. 4. But really just removing any kind of identifiable ID is better than trying to obscure it.
- Set lower bounds for aggregation: if someone filters an API down to a single record, you may want to return nothing. There's the case where someone's able to craft a well-filtered query and just see one user’s data. See for example the Strava Heatmap Privacy quandry: one of the effects was to make sure that traces wouldn't be shown in an area if there was only one user making them.
- Quantizing data is also a good first step to reduce disclosure risk. For research purposes people don't often need foot-accuracy GPS information, or second-accuracy time. Quantizing these values to a 5-minute interval or 500m block (or some other value) can reduce some risk without affecting the overall statistics. 4. Similarly, fuzzing data by gaussian skew can give you better anonymity without affecting the big picture.
This is just a starting point. The real tricky point of deanonymization is that it's not just about your dataset, but about how a combination of your dataset with other datasets can produce a greater-than-the-sum-of-its-parts increase in knowledge. For example
- If you release trip data for your transportation network releasing X personal detail
- Another transportation network releases trip data with Y personal detail
- A user transfers from one transportation network to the other in a predictable fashion and is correlated
- Then with X + Y data, the user may be identifiable.
The main academic / rigorous reference for this is k-anonymity, especially for a query API. I'd recommend running anything past an adversarial-minded developer friend to see if they can get any interesting results before things go live.