Skip to content

Instantly share code, notes, and snippets.

@mhoye
Last active May 18, 2016 22:04
Show Gist options
  • Save mhoye/99b10089cb53a038c942 to your computer and use it in GitHub Desktop.
Save mhoye/99b10089cb53a038c942 to your computer and use it in GitHub Desktop.
# This will give all the ICSE2015/2016 papers filenames extracted from the titles of the PDF metadata,
# and put them in a folder on your desktop. It's been tested on OSX and depends on wget and pdfinfo,
# available via the macports xpdf package on pre-Yosemite OSX, and here for Yosemite users:
#
# ftp://ftp.foolabs.com/pub/xpdf/xpdfbin-mac-3.04.tar.gz
#
# It's a manual install, unfortunately, but I trust you're geared up for that.
#
# The username and password are available from these public-facing pdfs:
# http://atlantis.isti.cnr.it/ICSE2015ProgramBrochureOnLineVersion.pdf (2015 - icse15/conf15)
# http://2016.icse.cs.txstate.edu/static/downloads/conference-brochure.pdf (2016 - icse16/conf16)
# The "grep [ab]" part is what filters out the papers from schedules, posters and other PDFs, which
# may or may not have Title metadata. Because someone thought "we'll use the fifth character in the
# filename as meaningful metadata. That is entirely sane and reasonable."
#
# This script was first written in 2015, and has been updated for ICSE 2016 mostly by replacing all
# instances of "15" with "16".
#
# I wish I was joking.
wget -c --user=icse16 --password=conf16 http://conferences.computer.org/icse/2016/content/icsefull.zip && unzip -d icsefull icsefull.zip
cd icsefull/content/papers/
mkdir sanity && cd sanity
for x in `ls ../*.pdf | grep [ab]`; do cp $x "`pdfinfo -meta $x | grep "^Title:" | sed 's/Title: //' | sed 's/\//-/g'`.pdf" ; done
cd .. && mv sanity ~/Desktop/ICSE2016
@mhoye
Copy link
Author

mhoye commented May 22, 2015

Also, I've added a command-line flag to that unzip command that solves the unzip thing, sorry about that.

@ivanov
Copy link

ivanov commented May 22, 2015

thanks for checking, 🍰 for you!

(also, one last thing, given that it's a 400M file, might be worth it to add the -c flag to wget, so you can resume in case of disconnect, ctrl-c, or whatever)

@mhoye
Copy link
Author

mhoye commented May 22, 2015

Done!

@gnomon-
Copy link

gnomon- commented May 27, 2015

Here's an alternative which..:

  1. adds a dependency on fuse-zip so the archive never needs to be decompressed (except when actually reading a PDF from within it), avoiding the file size tripling of the original (zip + decompressed + cp'd);
  2. adds a few sanity checks;
  3. roughly triples the script size (well, you can't hug all the cats)
#!/bin/bash
# This will give all the ICSE2015 papers filenames extracted from the titles of the PDF metadata,
# and put them in a folder on your desktop. It's been tested on OSX and depends on wget and xpdf.
#
# The username and password are available from this public-facing pdf:
#    http://atlantis.isti.cnr.it/ICSE2015ProgramBrochureOnLineVersion.pdf
# The "grep [ab]" part is what filters out the papers from schedules, posters and other PDFs, which may or may not have Title metadata.
# Because someone thought "we'll use the fifth character in the filename as metadata, that is sane and sensible."

declare    user pass url dir_src dir_dst tools sz x sed_prog
declare -A fns fn_b64
shopt   -s extglob

dir_src=${dir_src:-icse-2015}
dir_dst=${dir_dst:-${dir_src}-goodnames}
  # override these at invocation time to specify different dir names

user='icse15'
pass='conf15'
url='http://conferences.computer.org/icse/2015/content/icsefull.zip'

if tools='ln sed wget pdfinfo base64 fuse-zip' && ! which $tools ; then
    echo "could not find one or more required tools ($tools)"
    exit 1
elif ((BASH_VERSINFO[0] < 4)) ; then
    echo "need bash 4.x (for associative arrays), found $BASH_VERSION"
    exit 2
elif ! wget -c --user="$user" --password="$pass" "$url" ; then
    echo "failed to wget archive"
    exit 3
elif sz=$(stat -c %s ${url##*/}) && [[ $sz != 420278862 ]] ; then
    echo "archive has unexpected size $sz"
    exit 4
elif ! mkdir -vp "$dir_src" "$dir_dst" ; then
    echo "failed to create dirs $dir_src and/or $dir_dst"
    exit 5
elif ! fuse-zip -r "${url##*/}" "$dir_src" ; then
    echo "failed to fuse-zip mount archive"
    exit 6
fi

sed_prog='/^Title:/{'        # for every title metadatum...
    sed_prog+='s#^Title:          ##;' # truncate to just the datum
    sed_prog+='s#[–—/]#-#g;' # convert en-/em-dashes and slashes to hyphens
    sed_prog+='s#’#'"'"'#g;' # quotes to apostrophes (apostropocalyptic!)
    sed_prog+='s#[“”]#"#g;'  # "smart" quotes to smarten-the-hell-up quotes
    sed_prog+='s#  \+# #g;'  # collapse runs of multiple spaces
    sed_prog+='p;'           # print the result...
    sed_prog+='q;'           # ...and skip any further processing
sed_prog+='}'

for x in "${dir_src}/content/papers/"????[ab]*.pdf ; do
    fn=$( \
      pdfinfo -meta "$x" | \
      sed -ne "$sed_prog"  \
    )
    [[ $fn == @(Publisher?s\ Information|Author Index) ]] && continue
      # skip conference proceedings, pay attention only to papers;
      # the ????[ab]*.pdf pattern should already skip most of those
      # but let's just make doubly sure.
    [[ -z $fn ]] && fn='Untitled'
      # some PDFs lack title metadata so let's try a little harder
    fn_b64=$(echo "$fn" | base64)
    ((++fns[$fn_b64] > 1)) && fn+=".${fns["$fn_b64"]}"
      # mostly untitled PDFs, but also some others, have duplicated
      # names; let's make sure to catch those too
    ln -sv "$x" "${dir_dst}/${fn}.pdf"
      # the symlinks will persist even when FUSE is umounted but
      # they will only work when it is actually mounted.  Such is
      # the price of saving some disk space: extra complexity.
done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment