Skip to content

Instantly share code, notes, and snippets.

@behrangsa
Last active November 29, 2019 00:09
Show Gist options
  • Save behrangsa/178f49c7fbb8a68a7f75b38b23aad366 to your computer and use it in GitHub Desktop.
Save behrangsa/178f49c7fbb8a68a7f75b38b23aad366 to your computer and use it in GitHub Desktop.
download-arxiv.sh
#!/bin/bash
# A simple script for downloading files from the arXiv S3 bucket (s3://arxiv)
#
# Author: Behrang Saeedzadeh
# Copyright (c) 2019, Behrang Saeedzadeh
# All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright notice, this
# list of conditions and the following disclaimer.
# 2. Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
# ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
# WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
# ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
# (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
# LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
# SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
# The views and conclusions contained in the software and documentation are those
# of the authors and should not be interpreted as representing official policies,
# either expressed or implied, of the download-arxiv.sh project.
set -eux
#
# Variable declarations
#
MANIFEST_FILENAME="arXiv_pdf_manifest.xml"
#
# Pre-conditions
#
if ! [ -x "$(command -v xmlstarlet)" ]; then
echo 'Error: xmlstarlet is not installed.' >&2
exit 1
fi
if ! [ -x "$(command -v aws)" ]; then
echo 'Error: aws is not installed.' >&2
exit 1
fi
#
# Main logic
#
aws s3 cp --request-payer requester \
s3://arxiv/pdf/${MANIFEST_FILENAME} .
for filename in $(xmlstarlet sel -t -v "/arXivPDF/file/filename" ${MANIFEST_FILENAME})
do
aws s3 cp --request-payer requester \
s3://arxiv/${filename} .
done
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment