Datalad tutorial, based on its amazing documentation

Installation on Mac OS X

brew install datalad wget

Creating a new dataset

First, we create a new, empty dataset. (Reminder: a dataset refers to a folder of files, not any single file.)

datalad create \
  --description "course on DataLad-101 on my private Laptop" \
  -c text2git DataLad-101

Each DataLad dataset is initialized with git, and changes are tracked. We can check the current - very short - history of this dataset with git log.

cd DataLad-101
git log

Adding local files to the dataset

Great! Next, let's add some files to populate the dataset. First, we create a subdirectory

mkdir books
tree

and then we download some (free) PDFs into it

cd books
wget -q https://sourceforge.net/projects/linuxcommand/files/TLCL/19.01/TLCL-19.01.pdf/download \
  -O TLCL.pdf
wget -q https://homepages.uc.edu/~becktl/byte_of_python.pdf \
  -O byte-of-python.pdf
cd ../
tree

Datalad does not automatically track everything we add to a dataset. Let's check the current status of our dataset:

datalad status

Let's add our new files to the dataset with the datalad save command.

datalad save -m "add books on Python and Unix to read later"

🚨 As opposed to the usual git workflow, DataLad skips the staging area. The datalad save command combines the git add and a git commit actions.

A DataLad dataset could be small, e.g. a single text file, or very large, e.g. including files being many GB in size. Large content is tracked in an annex that is automatically created and handled by DataLad. (This behavior distinguishes DataLad from other version control tools, including git.)

git log -p -n 1

Let's add another book to our collection:

cd books
wget -q https://github.com/progit/progit2/releases/download/2.1.154/progit.pdf
cd ../

This time, we are specifying which file to add to the dataset explicitly (as is best practice).

datalad save -m "add reference book about git" books/progit.pdf
git log --oneline

Adding remote files to the dataset

So far, we have downloaded the PDF files first, and then added them to our dataset. But we can do even better: the datalad download-url command retrieves and adds a file in one step - and it records the remote URL automatically. The command supports various URL schemas, including e.g. http, https, ftp or s3.

datalad download-url \
  http://www.tldp.org/LDP/Bash-Beginners-Guide/Bash-Beginners-Guide.pdf \
  --dataset . \
  -m "add beginners guide on bash" \
  -O books/bash_guide.pdf
tree

The new PDF has been added to the dataset:

git log --oneline

Modifying content

Let’s see what happens if we add content, and then modify it.

cat << EOT > notes.txt
One can create a new dataset with 'datalad create [--description] PATH'.
The dataset is created empty

EOT

datalad status

datalad save -m "Add notes on datalad create"

Next, we modify the tracked notes.txt file, e.g. we append a new paragraph:

cat << EOT >> notes.txt
The command "datalad save [-m] PATH" saves the file (modifications) to
history.
Note to self: Always use informative, concise commit messages.

EOT

datalad status
datalad save -m "add note on datalad save"
git log -p -n 2

Installing existing datasets

Let's start by creating a new directory to store audio recordings

mkdir recordings

The datalad clone command takes a location of an existing dataset to clone, e.g. a URL or a path to a local directory, or an SSH server. Here, we clone a dataset that is publicly available on github.

datalad clone --dataset . \
 https://github.com/datalad-datasets/longnow-podcasts.git recordings/longnow
tree -d  # show directories

🚨 If we had not specified the path recordings/longnow as a destination for the dataset clone, the command would have installed the dataset into the root of the DataLad-101 dataset, and instead of longnow it would have used the name of the remote repository “longnow-podcasts”.

🧐 Why did the installation complete so quickly?

Upon installation of a DataLad dataset, DataLad retrieves only small files, e.g. text files and small metadata files, for the dataset. It has not downloaded any large files (yet).

cd recordings
du -sh

The .mp3 files listed in the subdirectories are actually symbolic links to the hidden .git/annex directory of the dataset. The real files don't exist on our system, yet.

ls -lh longnow/Long_Now__Conversations_at_The_Interval |head

The files tracked in the annex would be 15.4 GB in size:

cd longnow
datalad status --annex all

To actually retrieve a file, we can use the datalad get command, e.g. the first mp3 file:

datalad get \
  Long_Now__Seminars_About_Long_term_Thinking/2003_11_15__Brian_Eno__The_Long_Now.mp3

To download all files, we can use a period to refe to the full dataset, e.g. datalad get ..

Let's download a few files more (and notice that existing files are not downloaded again.)

datalad get \
  Long_Now__Seminars_About_Long_term_Thinking/2003_11_15__Brian_Eno__The_Long_Now.mp3 \
  Long_Now__Seminars_About_Long_term_Thinking/2003_12_13__Peter_Schwartz__The_Art_Of_The_Really_Long_View.mp3 \
  Long_Now__Seminars_About_Long_term_Thinking/2004_01_10__George_Dyson__There_s_Plenty_of_Room_at_the_Top__Long_term_Thinking_About_Large_scale_Computing.mp3
  
datalad status --annex all

If you decide that you do not need a copy of a file locally, you can remove it from your system with the datalad drop command.

datalad drop \
  Long_Now__Seminars_About_Long_term_Thinking/2004_01_10__George_Dyson__There_s_Plenty_of_Room_at_the_Top__Long_term_Thinking_About_Large_scale_Computing.mp3

datalad status --annex all

The file will remain in the dataset, and you can retrieve it from its source again any time.

datalad get \
  Long_Now__Seminars_About_Long_term_Thinking/2004_01_10__George_Dyson__There_s_Plenty_of_Room_at_the_Top__Long_term_Thinking_About_Large_scale_Computing.mp3

The datalad get command is also useful to update a previously retrieved dataset, e.g. because it has been modified (at its source) or because new files have been added. We can inspect the git log to see the latest changes.

git log --reverse  # list in chronological order instead of newst->oldest

Let's make a note of what we have learned and save the updated notes.txt to our dataset.

cd ../../
cat << EOT >> notes.txt
The command 'datalad clone URL/PATH [PATH]' installs a dataset from
e.g., a URL or a path. If you install a dataset into an existing
dataset (as a subdataset), remember to specify the root of the
superdataset with the '-d' option.

EOT
datalad save -m "Add note on datalad clone"

Nested datasets

By cloning the longnow dataset into a subfolder of our DataLad-101 dataset, we created a nested dataset. DataLad supports arbitrarily deep nesting of datasets and maintains them as separate entities, e.g. each of them has its own history.

Let's see what the history at the top level of our dataset recorded in its git log:

git log -p -n 3

We have stored what DataLad calls a subproject commit of the subdataset:

[truncated]
commit 9a18ee49a00fabe615501d3858e81cd32a0a8dfd
Author: Thomas Sandmann <[email protected]>
Date:   Sun Jun 18 12:18:04 2023 -0700

    [DATALAD] Added subdataset

diff --git a/.gitmodules b/.gitmodules
new file mode 100644
index 0000000..9bc9ee9
--- /dev/null
+++ b/.gitmodules
@@ -0,0 +1,5 @@
+[submodule "recordings/longnow"]
+       path = recordings/longnow
+       url = https://github.com/datalad-datasets/longnow-podcasts.git
+       datalad-id = b3ca2718-8901-11e8-99aa-a0369f7c647e
+       datalad-url = https://github.com/datalad-datasets/longnow-podcasts.git
diff --git a/recordings/longnow b/recordings/longnow
new file mode 160000
index 0000000..dcc34fb
--- /dev/null
+++ b/recordings/longnow
@@ -0,0 +1 @@
+Subproject commit dcc34fbe669b06ced84ced381ba0db21cf5e665f
[truncated]

The commit also contains the exact hash of the subdataset (shown again below for clarity):

+Subproject commit dcc34fbe669b06ced84ced381ba0db21cf5e665f

This hash refers to a specific version of the longnow dataset, and we can find it in its own git log:

cd recordings/longnow
git log --oneline

dcc34fb (HEAD -> master, origin/master, origin/HEAD) Update aggregated metadata
36a30a1 [DATALAD RUNCMD] Update from feed
bafdc04 Uniformize JSON-LD context with DataLad's internal extractors
004e484 [DATALAD RUNCMD] .datalad/maint/make_readme.py
7ee3ded Sort episodes newest-first
[truncated]

At this point, our nested dataset has the following structure:

Hint: The `-C` argument allows you to execute `git` or `datalad` commands in a specific subdirectory, so you don't have to change directories as often:

cd ../../
git -C recordings/longnow log --oneline

Performing scripted analyses

First, we create a new directory - outside of the longform subdataset - to store our analysis code.

# make sure you are in the DataLad-101 directory!
mkdir code
tree -d

Then we add a shell script to extract metadata from the filenames of the .mp3 files

cat << EOT > code/list_titles.sh
for i in recordings/longnow/Long_Now__Seminars*/*.mp3; do
   # get the filename
   base=\$(basename "\$i");
   # strip the extension
   base=\${base%.mp3};
   # date as yyyy-mm-dd
   printf "\${base%%__*}\t" | tr '_' '-';
   # name and title without underscores
   printf "\${base#*__}\n" | tr '_' ' ';
done
EOT

and add it to the main dataset:

datalad status
datalad save -m "Add short script to write a list of podcast speakers and titles"

We could run the bash script manually, redirect its output to a file - and then add it to the dataset ourselves. Or we could perform all of the above in a single step, recording the provenance of the output automatically with the datalad run command: it records a shell command, and saves all changes this command triggered in the dataset.

datalad run \
  -m "create a list of podcast titles" \
  "bash code/list_titles.sh > recordings/podcasts.tsv"

git log -p -n 1

The log file contains a section delimited by

=== Do not change lines below === and

^^^ Do not change lines above ^^^.

that contains the run record of the execution (in this case a very simple summary) that is essential for the datalad rerun command (see below).

Note: Only commands that actually led to changes to the dataset (e.g. modified, added or deleted files) will be saved in the dataset. For example, the following command returns its output to the terminal's standard output - and nothing is added to the dataset:

datalad run \
  -m "Try again to create a list of podcast titles" \
  "bash code/list_titles.sh > recordings/podcasts.tsv"

git log --oneline

Rerunning scripts

Let's examine the output file we created and added to the dataset above.

head recordings/podcasts.tsv

At this point, the file only contains information about podcasts in one of the longnow subdataset's subdirectories. Let's modify the bash script to track all of the files instead:

cat << EOT >| code/list_titles.sh

for i in recordings/longnow/Long_Now*/*.mp3; do

   # get the filename
   base=\$(basename "\$i");
   # strip the extension
   base=\${base%.mp3};
   printf "\${base%%__*}\t" | tr '_' '-';
   # name and title without underscores
   printf "\${base#*__}\n" | tr '_' ' ';

done
EOT

Because the script is part of the dataset, it is now out of date

datalad status

and we save it to the dataset:

datalad save -m "BF: list both directories content" \
  code/list_titles.sh

Next, we look up the hash of the previous execution of the script in the git log output:

git log -n 2

[truncated]
commit b846a4a00b666765352d68913a72610408e2dcb8
Author: Thomas Sandmann <[email protected]>
Date:   Sun Jun 18 12:47:53 2023 -0700

    [DATALAD RUNCMD] create a list of podcast titles

    === Do not change lines below ===
    {
     "chain": [],
     "cmd": "bash code/list_titles.sh > recordings/podcasts.tsv",
     "dsid": "43c85bc0-391c-4a4c-8f4b-9863a6a8e156",
     "exit": 0,
     "extra_inputs": [],
     "inputs": [],
     "outputs": [],
     "pwd": "."
    }
    ^^^ Do not change lines above ^^^

and use it to the a datalad rerun command:

datalad rerun b846a4a00b666765352d68913a72610408e2dcb8
git log -n 1

The datalad rerun command looked up the command in the run record section of the log (e.g. the section above the Do not change lines above delimiter) and re-executed it in the exact same way. Because the command updated an existing file within the dataset, the modified recordings/podcasts.tsv was saved to the dataset automatically.

The datalad diff command reports the differences between two commits, e.g.

datalad diff --to HEAD~1

and reports that the podcasts.tsv files has indeed changed. To see the details of the change, we can use git diff instead:

git diff HEAD~1

To quickly extract information about who and what created or modified a file, we can use the -- prefix and provide the path to the file of interest:

git log -- recordings/podcasts.tsv

Let's add this new knowledge to our notes.txt file:

cat << EOT >> notes.txt
There are two useful functions to display changes between two
states of a dataset: "datalad diff -f/--from COMMIT -t/--to COMMIT"
and "git diff COMMIT COMMIT", where COMMIT is a shasum of a commit
in the history.

EOT

datalad save -m "add note datalad and git diff"

🚨 datalad rerun can re-execute the run records of both a datalad run or a datalad rerun command, but not with any other type of datalad command in your history, datalad save. Therefore, it is best practice to record the execution of scripts by wrapping it in a call to datalad run.

cat << EOT >> notes.txt
The datalad run command can record the impact a script or command has
on a Dataset. In its simplest form, datalad run only takes a commit
message and the command that should be executed.

Any datalad run command can be re-executed by using its commit shasum
as an argument in datalad rerun CHECKSUM. DataLad will take
information from the run record of the original commit, and re-execute
it. If no changes happen with a rerun, the command will not be written
to history. Note: you can also rerun a datalad rerun command!

EOT

datalad save -m "add note on basic datalad run and datalad rerun"

Specifying inputs and outputs

Next, we compose more complicated (and powerful) datalad run commands. As an example, we will use ImageMagick to resize images that are included with the longnow DataLad dataset.

The original files are in a hidden directory:

ls recordings/longnow/.datalad/feed_metadata/*jpg

The following command use the convert command to instruct ImageMagick to resize the images and output them to the recordings directory.

datalad run \
  -m "Resize logo for slides" \
  "convert -resize 400x400 recordings/longnow/.datalad/feed_metadata/logo_salt.jpg recordings/salt_logo_small.jpg"

🧐 But wait - that didn't work! We haven't retrieved the actual image files with datalad get, yet!

This is where the -i/--input option for a datalad run becomes useful: The content of everything that is specified as an input will be retrieved prior to running the command.

datalad run -m "Resize logo for slides" \
  --input "recordings/longnow/.datalad/feed_metadata/logo_salt.jpg" \
  "convert -resize 400x400 recordings/longnow/.datalad/feed_metadata/logo_salt.jpg recordings/salt_logo_small.jpg"

By default, all files added to the git-annex are locked, e.g. they cannot be modified unless they are first unlocked with the datalad unlock command (and later relocked when the dataset is saved).

For existing files in the dataset that are modified by a datalad run command, this can be achieved automatically by specifying the output file(s).

For example, the following command will update the (saved and locked) files output by the previous datalad run command:

datalad run -m "Resize logo for slides" \
  --input "recordings/longnow/.datalad/feed_metadata/logo_interval.jpg" \
  --output "recordings/interval_logo_small.jpg" \
  "convert -resize 450x450 recordings/longnow/.datalad/feed_metadata/logo_interval.jpg recordings/interval_logo_small.jpg"

cat << EOT >> notes.txt
You should specify all files that a command takes as input with an
-i/--input flag. These files will be retrieved prior to the command
execution. Any content that is modified or produced by the command
should be specified with an -o/--output flag. Upon a run or rerun of
the command, the contents of these files will get unlocked so that
they can be modified.

EOT

datalad save -m "add additional notes on run options"

Placeholders

To avoid retyping the same parameters over again, you can use placeholders in the shell command:

datalad run -m "Resize logo for slides" \
  --input "recordings/longnow/.datalad/feed_metadata/logo_interval.jpg" \
  --output "recordings/interval_logo_small.jpg" \
  "convert -resize 450x450 {inputs} {outputs}"

🚨 If multiple inputs / outputs are specified they will be returned separated by spaces. To access specific variables by position, the {inputs[0]} notation can be used.

Dry-runs

The --dry-run=basic or dry-run=command arguments can be added to datalab run to report on its consequences without actually modifying any files.

Clean datasets

datalad run will only execute if there are no pending changes to the dataset.

cat << EOT >> notes.txt
Important! If the dataset is not "clean" (a datalad status output is
empty), datalad run will not work - you will have to save
modifications present in your dataset.
EOT

Alternatively, the `--explicit` argument can be used, which will allow `datalad run` to
execute, but _only_ allows it to modify the indicated output files.

cat << EOT >> notes.txt
A suboptimal alternative is the --explicit flag, used to record only
those changes done to the files listed with --output flags.

EOT

datalad save -m "add note on clean datasets"

Issues in the DataLad book:

2.3. Input and output: Alternative code presented as "or shorter" in the second datalad code example fails because it would generate the same output file again - but didn't unlock / specify the --output argument for it. (That is introduced on the same page but in a later example.)

tomsing1/datalad.md