Last active
May 18, 2016 22:04
-
-
Save mhoye/99b10089cb53a038c942 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# This will give all the ICSE2015/2016 papers filenames extracted from the titles of the PDF metadata, | |
# and put them in a folder on your desktop. It's been tested on OSX and depends on wget and pdfinfo, | |
# available via the macports xpdf package on pre-Yosemite OSX, and here for Yosemite users: | |
# | |
# ftp://ftp.foolabs.com/pub/xpdf/xpdfbin-mac-3.04.tar.gz | |
# | |
# It's a manual install, unfortunately, but I trust you're geared up for that. | |
# | |
# The username and password are available from these public-facing pdfs: | |
# http://atlantis.isti.cnr.it/ICSE2015ProgramBrochureOnLineVersion.pdf (2015 - icse15/conf15) | |
# http://2016.icse.cs.txstate.edu/static/downloads/conference-brochure.pdf (2016 - icse16/conf16) | |
# The "grep [ab]" part is what filters out the papers from schedules, posters and other PDFs, which | |
# may or may not have Title metadata. Because someone thought "we'll use the fifth character in the | |
# filename as meaningful metadata. That is entirely sane and reasonable." | |
# | |
# This script was first written in 2015, and has been updated for ICSE 2016 mostly by replacing all | |
# instances of "15" with "16". | |
# | |
# I wish I was joking. | |
wget -c --user=icse16 --password=conf16 http://conferences.computer.org/icse/2016/content/icsefull.zip && unzip -d icsefull icsefull.zip | |
cd icsefull/content/papers/ | |
mkdir sanity && cd sanity | |
for x in `ls ../*.pdf | grep [ab]`; do cp $x "`pdfinfo -meta $x | grep "^Title:" | sed 's/Title: //' | sed 's/\//-/g'`.pdf" ; done | |
cd .. && mv sanity ~/Desktop/ICSE2016 |
thanks for checking, 🍰 for you!
(also, one last thing, given that it's a 400M file, might be worth it to add the -c
flag to wget, so you can resume in case of disconnect, ctrl-c, or whatever)
Done!
Here's an alternative which..:
- adds a dependency on
fuse-zip
so the archive never needs to be decompressed (except when actually reading a PDF from within it), avoiding the file size tripling of the original (zip + decompressed +cp
'd); - adds a few sanity checks;
- roughly triples the script size (well, you can't hug all the cats)
#!/bin/bash
# This will give all the ICSE2015 papers filenames extracted from the titles of the PDF metadata,
# and put them in a folder on your desktop. It's been tested on OSX and depends on wget and xpdf.
#
# The username and password are available from this public-facing pdf:
# http://atlantis.isti.cnr.it/ICSE2015ProgramBrochureOnLineVersion.pdf
# The "grep [ab]" part is what filters out the papers from schedules, posters and other PDFs, which may or may not have Title metadata.
# Because someone thought "we'll use the fifth character in the filename as metadata, that is sane and sensible."
declare user pass url dir_src dir_dst tools sz x sed_prog
declare -A fns fn_b64
shopt -s extglob
dir_src=${dir_src:-icse-2015}
dir_dst=${dir_dst:-${dir_src}-goodnames}
# override these at invocation time to specify different dir names
user='icse15'
pass='conf15'
url='http://conferences.computer.org/icse/2015/content/icsefull.zip'
if tools='ln sed wget pdfinfo base64 fuse-zip' && ! which $tools ; then
echo "could not find one or more required tools ($tools)"
exit 1
elif ((BASH_VERSINFO[0] < 4)) ; then
echo "need bash 4.x (for associative arrays), found $BASH_VERSION"
exit 2
elif ! wget -c --user="$user" --password="$pass" "$url" ; then
echo "failed to wget archive"
exit 3
elif sz=$(stat -c %s ${url##*/}) && [[ $sz != 420278862 ]] ; then
echo "archive has unexpected size $sz"
exit 4
elif ! mkdir -vp "$dir_src" "$dir_dst" ; then
echo "failed to create dirs $dir_src and/or $dir_dst"
exit 5
elif ! fuse-zip -r "${url##*/}" "$dir_src" ; then
echo "failed to fuse-zip mount archive"
exit 6
fi
sed_prog='/^Title:/{' # for every title metadatum...
sed_prog+='s#^Title: ##;' # truncate to just the datum
sed_prog+='s#[–—/]#-#g;' # convert en-/em-dashes and slashes to hyphens
sed_prog+='s#’#'"'"'#g;' # quotes to apostrophes (apostropocalyptic!)
sed_prog+='s#[“”]#"#g;' # "smart" quotes to smarten-the-hell-up quotes
sed_prog+='s# \+# #g;' # collapse runs of multiple spaces
sed_prog+='p;' # print the result...
sed_prog+='q;' # ...and skip any further processing
sed_prog+='}'
for x in "${dir_src}/content/papers/"????[ab]*.pdf ; do
fn=$( \
pdfinfo -meta "$x" | \
sed -ne "$sed_prog" \
)
[[ $fn == @(Publisher?s\ Information|Author Index) ]] && continue
# skip conference proceedings, pay attention only to papers;
# the ????[ab]*.pdf pattern should already skip most of those
# but let's just make doubly sure.
[[ -z $fn ]] && fn='Untitled'
# some PDFs lack title metadata so let's try a little harder
fn_b64=$(echo "$fn" | base64)
((++fns[$fn_b64] > 1)) && fn+=".${fns["$fn_b64"]}"
# mostly untitled PDFs, but also some others, have duplicated
# names; let's make sure to catch those too
ln -sv "$x" "${dir_dst}/${fn}.pdf"
# the symlinks will persist even when FUSE is umounted but
# they will only work when it is actually mounted. Such is
# the price of saving some disk space: extra complexity.
done
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Also, I've added a command-line flag to that unzip command that solves the unzip thing, sorry about that.