The users and contributors of the Internet Archive are what makes Archive.org what it is today. Without contributions from our users, we would have nothing, and without users accessing our digital materials it would mean nothing.
This document will give a brief overview on how to get data into, and out of, Archive.org.
Table of Contents:
https://github.com/jjjake/internetarchive is a Python library and command-line interface to Archive.org. It is a tool for getting data into and out of the Internet Archive.
Binaries for the CLI are available here: https://archive.org/details/ia-pex
It can also be installed via pip install internetarchive.
To get started, simply download a binary and configure the ia command.
..code bash:
$ curl -L https://archive.org/download/ia-pex/ia-0.8.2-py2.pex > ia $ chmod +x ia $ ./ia configure
You will be prompted to enter your Archive.org credentials. After doing so, a config file will be saved to your computer with everything you need to start uploading and modifying metadata via the ia command.
There are about 15 million public items on Archive.org. Over 2 million of those items have been uploaded using the internetarchive library. Below is a brief overview of uploading using the CLI.
$ ./ia upload <identifier> <files>... --metadata=collection:test_collection --metadata='title:My Title'
See youtube2ia.sh for a more advanced example of how you might use ia upload in a bash script to mirror a Youtube channel to Archive.org.
You can also use a spreadsheet for uploading a batch of files. See metadata.csv for an example of the required format.
$ ./ia upload --spreadsheet=metadata.csv
Downloading files via the ia command is easy:
$ ./ia download nasa
And flexible:
$ ./ia download Sita_Sings_the_Blues --format="Ogg Vorbis" --destdir="~/Downloads"
You can even glob for files to download:
$ ./ia download OTRR_X_Minus_One_Singles --glob="*mp3"
Modifying and retrieving metadata for items can be done with the ia metadata command.
Retrieving the metadata for an Archive.org item in JSON is as easy as:
$ ./ia metadata nasa
To modify the metadata for an item, you could use a command such as the following:
$ ./ia metadata iacli-test-item60 --modify='title:My New Title' --modify='foo:bar'
https://github.com/jjjake/iamine is a Python library and command-line tool for mining Archive.org metadata and search results.
Binaries for the CLI are available here: https://archive.org/details/ia-pex
It can also be installed via pip install iamine. iamine requires Python 3.
https://archive.org/download/iamine-pex/ia-mine-0.3.0-py3.pex
$ curl -L https://archive.org/download/iamine-pex/ia-mine-0.3.0-py3.pex > ia-mine
$ chmod +x ia-mine
$ ./ia-mine --configure
With ia-mine you can do things like...
Concurrently download an entire Archive.org collection using GNU Parallel:
$ ./ia-mine --search 'collection:freemusicarchive' --itemlist | parallel 'ia download {}'
ia-mine is especially powerful when used with a command-line JSON parsing tool, such as jq:
$ ./ia-mine --all --mine-ids | jq -r '.metadata.identifier as $id | .files | map("\($id)\t\(.name)\t\(.sha1)") | join("\n")'
Find all of the EXE files that are on Archive.org:
$ ./ia-mine --search 'format:exe' --mine-ids | jq -r '.metadata.identifier as $i | .files | map(select(.format == "exe") | "https://archive.org/download/\($i)/\(.name)") | join("\n")'
Monitor progress and transfer rate with Pipe Viewer:
$ ./ia-mine --search 'collection:usenet' --mine-ids 2> errors.json | pv -acbrl > usenet-metadata.json