Skip to content

Instantly share code, notes, and snippets.

@leepro
Last active August 29, 2015 14:11
Show Gist options
  • Select an option

  • Save leepro/a876e9e5f414b54624f9 to your computer and use it in GitHub Desktop.

Select an option

Save leepro/a876e9e5f414b54624f9 to your computer and use it in GitHub Desktop.
#!/bin/sh
#
# Author: D. Lee
#
#
# Assume the input file has a format as follows:
# XXXXX ID URL YYYYY
#
#
# Every 1K URLs, it needs to create a new folder for WARCs and check the free space of Disk.
# So, each folder has 1K WARC files.
#
DISK="disk0s2"
MAXFILE=1000
cat $1 | awk '{ print $2,$3 }' > $1.filtered
function check_free_space_stop()
{
FREE=$(df -h | grep $DISK | awk '{ print $5 }' | sed -e 's/%//')
# if the free space of DISK is less than 10, stop it!
if [ $FREE -gt '10' ]
then
echo "Enough"
else
echo "Stop"
fi
}
i=0
fid=0
mkdir -p "./temp/00"
while read -r taskid url;
do
echo "$url"
wget "$url" --warc-file="temp/`printf '%02d' $fid`/$taskid" --no-warc-compression -O /tmp/wget_temp_file
let "i=i+1"
if [ $((i%MAXFILE)) == 0 ];
then
if [ $(check_free_space_stop) == "Stop" ];
then
exit
fi
mkdir -p "./temp/`printf '%02d' $fid`"
let "fid=fid+1"
fi
done < $1.filtered
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment