If JSON data is to be shared for analysis, it's often necessary to depersonalize the data first. Depersonalization is different than anonymization. If data is anonymized, all information about the user is removed. Depersonalization, however, changes the user information to values that can't be recognized as being related to the real user. This way, it's still possible to see relationships within the data.
To use the depersonalize.sh
shell script, either give the name of the input
file to be depersonalized as an argument:
depersonalize.sh events.jsonl
Or pipe the input into the script through stdin, which allows it to be used as part of more complex pipelines:
grep tarfu events.jsonl | depersonalize.sh
The script uses jq to extract the .actor.name
property from each object
found in the JSONL input. It passes those names through the md5
program to
produce a hash for each name. Then jq is used again to format the list of
names and hashes to JSON of the form:
{
"name1": "name1_md5_hash",
"name2": "name2_md5_hash",
"nameN": "nameN_md5_hash"
}
That JSON of names and hashes are written to a temporary file for use in the next step.
Note: These first two steps wouldn't need to be separate calls to jq if the program had builtin support for MD5 or allowed calls to external programs.
Finally, jq is called again, with the JSON of names and hashes assigned to a
variable. A short program replaces each .actor.name
value with its hash from
the variable. Additionally, the example data uses the users' names as part of
the .actor.@id
property, so a substitution replaces that part of the value.
The program sends the output to stdout, so be sure to redirect it to a file to keep
An example of an object from the JSONL input:
{"actor":{"@id":"https://example.edu/#profile:tarfu","@context":"http://purl.imsglobal.org/ctx/caliper/v1/Context","@type":"http://purl.imsglobal.org/caliper/v1/lis/Person","name":"tarfu"}}
Although it won't be seen when the program executes, the temporary JSON file of the name and its hash from that JSONL would be like:
{
"tarfu": "e2b87e98602ac8fc95f49fbc3f5c7b1d"
}
The final output of the program is:
{"@id":"https://example.edu/#profile:e2b87e98602ac8fc95f49fbc3f5c7b1d","@context":"http://purl.imsglobal.org/ctx/caliper/v1/Context","@type":"http://purl.imsglobal.org/caliper/v1/lis/Person","name":"e2b87e98602ac8fc95f49fbc3f5c7b1d"}
🚧 To-do:
.actor.name
(line 6 of depersonalize.sh).md5
(line 11 of depersonalize.sh).