When you have to manually kill an ArchiveBot web scraping job on one of your pipeline servers, or if the job crashes on its own, the incomplete WARC files do usually move over to FOS, but the log.gz file does not. You have to manually find the proper file, rename it in just the right way, and then rsync it yourself.
-
Make a note somewhere of the job id of the stuck job, such as
aqz8ac6ar202mulnvn8xpzv3f
. Also make note of the way the WARC's and JSON's are named, such aswww.gog.com-inf-20180603-063227-aqz8a.json
Note that the first five letters of the job id are the last five letters of the filename. (The log files do not follow the same naming convention.) -
Kill-9 the stuck job.
-
Watch the ArchiveBot dashboard to make sure the incomplete WARC and JSON files do indeed upload to FOS and the job is done.
-
Go into the ~/ArchiveBot/pipeline/ directory. Look at the various blahblahblah.log.gz files in there. It is probably impossible to tell just by looking which of these log files corresponds to the just-flushed job.
-
One by one, do a
zcat SOMETHING.log.gz | head -2
on each of the log.gz files. For example,zcat tmp-wpull-warc-2o0loq4o.log.gz | head -2
. Look at the output; the second line should have spit out the job id. Manually check it against the job id to see if this is the right log file. NOTE: checking only the job ID might not be enough especially in the case of aborted and shortly thereafter requeued jobs. -
If it's the right log file, rename it to the same pattern as the WARC and JSON files. For example,
mv tmp-wpull-warc-2o0loq4o.log.gz www.gog.com-inf-20180603-063227-aqz8a.log.gz
. -
Then use rsync to upload this log file to FOS. You cannot just move it into the ~/warcs4fos/ directory because the uploader running in there doesn't know what to do with log files yet. So do
rsync -tv --timeout=300 --contimeout=300 --progress --ignore-existing YOUR-LOGFILE-HERE rsync://fos.textfiles.com/archivebot/
where LOGFILE is replaced by the name of this log file, such asrsync -tv --timeout=300 --contimeout=300 --progress --ignore-existing www.gog.com-inf-20180603-063227-aqz8a.log.gz rsync://fos.textfiles.com/archivebot/