I was trying to download the google landmark dataset from here. To speed up downloading, I wrote a shell script which can downloads several files in a list simutaneously:
# firstly, generate a URL_LIST which contains all urls to files that you want to download
# NOTE: use whitespace to seperate each url, see https://stackoverflow.com/a/28806991/6769366 for more details.
for i in {100..200}
do
URL_LIST=$URL_LIST\ "https://s3.amazonaws.com/google-landmark/train/images_$i.tar"
done
# NOTE: git-bash/mingw on windows(10) does not come with `wget`.
# To install latest `wget`, check https://gist.github.com/evanwill/0207876c3243bbb6863e65ec5dc3f058.
# To learn how `xargs` works, see https://stackoverflow.com/a/11850469/6769366
# `-e` for `wget` is to set proxy. See https://superuser.com/a/526779 for details. Be in mind that `set proxy=127.0.0.1:1080` does not work in git-bash for windows.
# `-q` for `wget` is to mute the output of `wget`.
echo $URL_LIST | xargs -n 1 -P 6 wget -e https_proxy=127.0.0.1:1080 -q
You can run this script on Linux/Windows. Specially, if you want to run this script on windows, you need either WSL(Windows Subsystem for Linux) or git-bash with wget
installed. Furthermore, if you run the script on git-bash, you need to run it using sh
like:
sh download.sh
See https://stackoverflow.com/a/44884649/6769366 for details.
[x] Implementation of adding an 0
when $i
is less than 100.
[ ] Since the messive output of wget
is muted using -q
, a decent progress bar is needed.
[ ] Perform checksum in a multiprocessing manner.