brew install datalad wget
First, we create a new, empty dataset. (Reminder: a dataset
refers to a folder
of files, not any single file.)
datalad create \
--description "course on DataLad-101 on my private Laptop" \
-c text2git DataLad-101
Each DataLad dataset is initialized with git, and changes are tracked. We can check the
current - very short - history of this dataset with git log
.
cd DataLad-101
git log
Great! Next, let's add some files to populate the dataset. First, we create a subdirectory
mkdir books
tree
and then we download some (free) PDFs into it
cd books
wget -q https://sourceforge.net/projects/linuxcommand/files/TLCL/19.01/TLCL-19.01.pdf/download \
-O TLCL.pdf
wget -q https://homepages.uc.edu/~becktl/byte_of_python.pdf \
-O byte-of-python.pdf
cd ../
tree
Datalad does not automatically track everything we add to a dataset. Let's check the current status of our dataset:
datalad status
Let's add our new files to the dataset with the datalad save
command.
datalad save -m "add books on Python and Unix to read later"
🚨 As opposed to the usual git workflow, DataLad skips the staging area.
The datalad save
command combines the git add
and a git commit
actions.
A DataLad dataset could be small, e.g. a single text file, or very large, e.g. including files being many GB in size. Large content is tracked in an annex that is automatically created and handled by DataLad. (This behavior distinguishes DataLad from other version control tools, including git.)
git log -p -n 1
Let's add another book to our collection:
cd books
wget -q https://github.com/progit/progit2/releases/download/2.1.154/progit.pdf
cd ../
This time, we are specifying which file to add to the dataset explicitly (as is best practice).
datalad save -m "add reference book about git" books/progit.pdf
git log --oneline
So far, we have downloaded the PDF files first, and then added them to our dataset.
But we can do even better: the datalad download-url
command retrieves and adds a file
in one step - and it records the remote URL automatically. The command supports various
URL schemas, including e.g. http
, https
, ftp
or s3
.
datalad download-url \
http://www.tldp.org/LDP/Bash-Beginners-Guide/Bash-Beginners-Guide.pdf \
--dataset . \
-m "add beginners guide on bash" \
-O books/bash_guide.pdf
tree
The new PDF has been added to the dataset:
git log --oneline
Let’s see what happens if we add content, and then modify it.
cat << EOT > notes.txt
One can create a new dataset with 'datalad create [--description] PATH'.
The dataset is created empty
EOT
datalad status
datalad save -m "Add notes on datalad create"
Next, we modify the tracked notes.txt
file, e.g. we append a new paragraph:
cat << EOT >> notes.txt
The command "datalad save [-m] PATH" saves the file (modifications) to
history.
Note to self: Always use informative, concise commit messages.
EOT
datalad status
datalad save -m "add note on datalad save"
git log -p -n 2
Let's start by creating a new directory to store audio recordings
mkdir recordings
The datalad clone command takes a location of an existing dataset to clone, e.g. a URL or a path to a local directory, or an SSH server. Here, we clone a dataset that is publicly available on github.
datalad clone --dataset . \
https://github.com/datalad-datasets/longnow-podcasts.git recordings/longnow
tree -d # show directories
🚨 If we had not specified the path recordings/longnow as a destination for the dataset clone, the command would have installed the dataset into the root of the DataLad-101 dataset, and instead of longnow it would have used the name of the remote repository “longnow-podcasts”.
🧐 Why did the installation complete so quickly?
Upon installation of a DataLad dataset, DataLad retrieves only small files, e.g. text files and small metadata files, for the dataset. It has not downloaded any large files (yet).
cd recordings
du -sh
The .mp3
files listed in the subdirectories are actually symbolic links to the hidden
.git/annex
directory of the dataset. The real files don't exist on our system, yet.
ls -lh longnow/Long_Now__Conversations_at_The_Interval |head
The files tracked in the annex would be 15.4 GB in size:
cd longnow
datalad status --annex all
To actually retrieve a file, we can use the datalad get
command, e.g. the first
mp3
file:
datalad get \
Long_Now__Seminars_About_Long_term_Thinking/2003_11_15__Brian_Eno__The_Long_Now.mp3
To download all files, we can use a period to refe to the full dataset,
e.g. datalad get .
.
Let's download a few files more (and notice that existing files are not downloaded again.)
datalad get \
Long_Now__Seminars_About_Long_term_Thinking/2003_11_15__Brian_Eno__The_Long_Now.mp3 \
Long_Now__Seminars_About_Long_term_Thinking/2003_12_13__Peter_Schwartz__The_Art_Of_The_Really_Long_View.mp3 \
Long_Now__Seminars_About_Long_term_Thinking/2004_01_10__George_Dyson__There_s_Plenty_of_Room_at_the_Top__Long_term_Thinking_About_Large_scale_Computing.mp3
datalad status --annex all
If you decide that you do not need a copy of a file locally, you can remove it from
your system with the datalad drop
command.
datalad drop \
Long_Now__Seminars_About_Long_term_Thinking/2004_01_10__George_Dyson__There_s_Plenty_of_Room_at_the_Top__Long_term_Thinking_About_Large_scale_Computing.mp3
datalad status --annex all
The file will remain in the dataset, and you can retrieve it from its source again any time.
datalad get \
Long_Now__Seminars_About_Long_term_Thinking/2004_01_10__George_Dyson__There_s_Plenty_of_Room_at_the_Top__Long_term_Thinking_About_Large_scale_Computing.mp3
The datalad get
command is also useful to update a previously retrieved dataset,
e.g. because it has been modified (at its source) or because new files have been added.
We can inspect the git log
to see the latest changes.
git log --reverse # list in chronological order instead of newst->oldest
Let's make a note of what we have learned and save the updated notes.txt
to our dataset.
cd ../../
cat << EOT >> notes.txt
The command 'datalad clone URL/PATH [PATH]' installs a dataset from
e.g., a URL or a path. If you install a dataset into an existing
dataset (as a subdataset), remember to specify the root of the
superdataset with the '-d' option.
EOT
datalad save -m "Add note on datalad clone"
By cloning the longnow
dataset into a subfolder of our DataLad-101
dataset, we
created a nested dataset. DataLad supports arbitrarily deep nesting of datasets and
maintains them as separate entities, e.g. each of them has its own history.
Let's see what the history at the top level of our dataset recorded in its git log:
git log -p -n 3
We have stored what DataLad calls a subproject commit
of the subdataset:
[truncated]
commit 9a18ee49a00fabe615501d3858e81cd32a0a8dfd
Author: Thomas Sandmann <[email protected]>
Date: Sun Jun 18 12:18:04 2023 -0700
[DATALAD] Added subdataset
diff --git a/.gitmodules b/.gitmodules
new file mode 100644
index 0000000..9bc9ee9
--- /dev/null
+++ b/.gitmodules
@@ -0,0 +1,5 @@
+[submodule "recordings/longnow"]
+ path = recordings/longnow
+ url = https://github.com/datalad-datasets/longnow-podcasts.git
+ datalad-id = b3ca2718-8901-11e8-99aa-a0369f7c647e
+ datalad-url = https://github.com/datalad-datasets/longnow-podcasts.git
diff --git a/recordings/longnow b/recordings/longnow
new file mode 160000
index 0000000..dcc34fb
--- /dev/null
+++ b/recordings/longnow
@@ -0,0 +1 @@
+Subproject commit dcc34fbe669b06ced84ced381ba0db21cf5e665f
[truncated]
The commit also contains the exact hash of the subdataset (shown again below for clarity):
+Subproject commit dcc34fbe669b06ced84ced381ba0db21cf5e665f
This hash refers to a specific version of the longnow dataset, and we can find it in its
own git log
:
cd recordings/longnow
git log --oneline
dcc34fb (HEAD -> master, origin/master, origin/HEAD) Update aggregated metadata
36a30a1 [DATALAD RUNCMD] Update from feed
bafdc04 Uniformize JSON-LD context with DataLad's internal extractors
004e484 [DATALAD RUNCMD] .datalad/maint/make_readme.py
7ee3ded Sort episodes newest-first
[truncated]
At this point, our nested dataset has the following structure:
cd ../../
git -C recordings/longnow log --oneline
First, we create a new directory - outside of the longform
subdataset - to store our
analysis code.
# make sure you are in the DataLad-101 directory!
mkdir code
tree -d
Then we add a shell script to extract metadata from the filenames of the .mp3
files
cat << EOT > code/list_titles.sh
for i in recordings/longnow/Long_Now__Seminars*/*.mp3; do
# get the filename
base=\$(basename "\$i");
# strip the extension
base=\${base%.mp3};
# date as yyyy-mm-dd
printf "\${base%%__*}\t" | tr '_' '-';
# name and title without underscores
printf "\${base#*__}\n" | tr '_' ' ';
done
EOT
and add it to the main dataset:
datalad status
datalad save -m "Add short script to write a list of podcast speakers and titles"
We could run the bash script manually, redirect its output to a file - and then add it
to the dataset ourselves. Or we could perform all of the above in a single step, recording
the provenance of the output automatically with the datalad run
command: it records a
shell command, and saves all changes this command triggered in the dataset.
datalad run \
-m "create a list of podcast titles" \
"bash code/list_titles.sh > recordings/podcasts.tsv"
git log -p -n 1
The log file contains a section delimited by
=== Do not change lines below === and
^^^ Do not change lines above ^^^.
that contains the run record
of the execution (in this case a very simple summary) that
is essential for the datalad rerun command
(see below).
Note: Only commands that actually led to changes to the dataset (e.g. modified, added or deleted files) will be saved in the dataset. For example, the following command returns its output to the terminal's standard output - and nothing is added to the dataset:
datalad run \
-m "Try again to create a list of podcast titles" \
"bash code/list_titles.sh > recordings/podcasts.tsv"
git log --oneline
Let's examine the output file we created and added to the dataset above.
head recordings/podcasts.tsv
At this point, the file only contains information about podcasts in one of the
longnow
subdataset's subdirectories. Let's modify the bash script to track all of the
files instead:
cat << EOT >| code/list_titles.sh
for i in recordings/longnow/Long_Now*/*.mp3; do
# get the filename
base=\$(basename "\$i");
# strip the extension
base=\${base%.mp3};
printf "\${base%%__*}\t" | tr '_' '-';
# name and title without underscores
printf "\${base#*__}\n" | tr '_' ' ';
done
EOT
Because the script is part of the dataset, it is now out of date
datalad status
and we save it to the dataset:
datalad save -m "BF: list both directories content" \
code/list_titles.sh
Next, we look up the hash of the previous execution of the script in the git log
output:
git log -n 2
[truncated]
commit b846a4a00b666765352d68913a72610408e2dcb8
Author: Thomas Sandmann <[email protected]>
Date: Sun Jun 18 12:47:53 2023 -0700
[DATALAD RUNCMD] create a list of podcast titles
=== Do not change lines below ===
{
"chain": [],
"cmd": "bash code/list_titles.sh > recordings/podcasts.tsv",
"dsid": "43c85bc0-391c-4a4c-8f4b-9863a6a8e156",
"exit": 0,
"extra_inputs": [],
"inputs": [],
"outputs": [],
"pwd": "."
}
^^^ Do not change lines above ^^^
and use it to the a datalad rerun
command:
datalad rerun b846a4a00b666765352d68913a72610408e2dcb8
git log -n 1
The datalad rerun
command looked up the command in the run record
section of the log
(e.g. the section above the Do not change lines above
delimiter) and re-executed it
in the exact same way. Because the command updated an existing file within the dataset,
the modified recordings/podcasts.tsv
was saved to the dataset automatically.
The datalad diff
command reports the differences between two commits, e.g.
datalad diff --to HEAD~1
and reports that the podcasts.tsv
files has indeed changed. To see the details of the
change, we can use git diff
instead:
git diff HEAD~1
To quickly extract information about who and what created or modified a file, we can use
the --
prefix and provide the path to the file of interest:
git log -- recordings/podcasts.tsv
Let's add this new knowledge to our notes.txt
file:
cat << EOT >> notes.txt
There are two useful functions to display changes between two
states of a dataset: "datalad diff -f/--from COMMIT -t/--to COMMIT"
and "git diff COMMIT COMMIT", where COMMIT is a shasum of a commit
in the history.
EOT
datalad save -m "add note datalad and git diff"
🚨 datalad rerun
can re-execute the run records of both a datalad run
or a
datalad rerun
command, but not with any other type of datalad command in your history,
datalad save
. Therefore, it is best practice to record the execution of scripts by
wrapping it in a call to datalad run
.
cat << EOT >> notes.txt
The datalad run command can record the impact a script or command has
on a Dataset. In its simplest form, datalad run only takes a commit
message and the command that should be executed.
Any datalad run command can be re-executed by using its commit shasum
as an argument in datalad rerun CHECKSUM. DataLad will take
information from the run record of the original commit, and re-execute
it. If no changes happen with a rerun, the command will not be written
to history. Note: you can also rerun a datalad rerun command!
EOT
datalad save -m "add note on basic datalad run and datalad rerun"
Next, we compose more complicated (and powerful) datalad run
commands. As an example, we
will use
ImageMagick
to resize images that are included with the longnow
DataLad dataset.
The original files are in a hidden directory:
ls recordings/longnow/.datalad/feed_metadata/*jpg
The following command use the convert
command to instruct ImageMagick to resize the
images and output them to the recordings
directory.
datalad run \
-m "Resize logo for slides" \
"convert -resize 400x400 recordings/longnow/.datalad/feed_metadata/logo_salt.jpg recordings/salt_logo_small.jpg"
🧐 But wait - that didn't work! We haven't retrieved the actual image files with
datalad get
, yet!
This is where the -i/--input option for a datalad run becomes useful: The content of everything that is specified as an input will be retrieved prior to running the command.
datalad run -m "Resize logo for slides" \
--input "recordings/longnow/.datalad/feed_metadata/logo_salt.jpg" \
"convert -resize 400x400 recordings/longnow/.datalad/feed_metadata/logo_salt.jpg recordings/salt_logo_small.jpg"
By default, all files added to the git-annex are locked, e.g. they cannot be modified
unless they are first unlocked with the datalad unlock
command (and later relocked
when the dataset is saved).
For existing files in the dataset that are modified by a datalad run
command, this can
be achieved automatically by specifying the output file(s).
For example, the following command will update the (saved and locked) files output by the
previous datalad run
command:
datalad run -m "Resize logo for slides" \
--input "recordings/longnow/.datalad/feed_metadata/logo_interval.jpg" \
--output "recordings/interval_logo_small.jpg" \
"convert -resize 450x450 recordings/longnow/.datalad/feed_metadata/logo_interval.jpg recordings/interval_logo_small.jpg"
cat << EOT >> notes.txt
You should specify all files that a command takes as input with an
-i/--input flag. These files will be retrieved prior to the command
execution. Any content that is modified or produced by the command
should be specified with an -o/--output flag. Upon a run or rerun of
the command, the contents of these files will get unlocked so that
they can be modified.
EOT
datalad save -m "add additional notes on run options"
To avoid retyping the same parameters over again, you can use placeholders in the shell command:
datalad run -m "Resize logo for slides" \
--input "recordings/longnow/.datalad/feed_metadata/logo_interval.jpg" \
--output "recordings/interval_logo_small.jpg" \
"convert -resize 450x450 {inputs} {outputs}"
🚨 If multiple inputs / outputs are specified they will be returned separated by spaces.
To access specific variables by position, the {inputs[0]}
notation can be used.
The --dry-run=basic
or dry-run=command
arguments can be added to datalab run
to
report on its consequences without actually modifying any files.
datalad run
will only execute if there are no pending changes to the dataset.
cat << EOT >> notes.txt
Important! If the dataset is not "clean" (a datalad status output is
empty), datalad run will not work - you will have to save
modifications present in your dataset.
EOT
Alternatively, the `--explicit` argument can be used, which will allow `datalad run` to
execute, but _only_ allows it to modify the indicated output files.
cat << EOT >> notes.txt
A suboptimal alternative is the --explicit flag, used to record only
those changes done to the files listed with --output flags.
EOT
datalad save -m "add note on clean datasets"
- 2.3. Input and output:
Alternative code presented as "or shorter" in the second datalad code example fails
because it would generate the same output file again - but didn't unlock / specify the
--output
argument for it. (That is introduced on the same page but in a later example.)