Skip to content

Instantly share code, notes, and snippets.

@paveljurca
Last active October 11, 2015 16:32
Show Gist options
  • Save paveljurca/01543d2f53088d7afd59 to your computer and use it in GitHub Desktop.
Save paveljurca/01543d2f53088d7afd59 to your computer and use it in GitHub Desktop.
půjčit mlp.cz
#!/bin/bash
# EXAMPLE URL list to scrap
# echo 'http://search.mlp.cz/cz/titul/faunovo-odpoledne-a-jine-basne/56699/|http://search.mlp.cz/cz/titul/podivuhodne-pribehy-z-dob-nedavno-minulych/2750824/|http://search.mlp.cz/cz/titul/zpevy-pastyrske/2430570/|http://search.mlp.cz/cz/titul/kam-odchazis-kraso/2192687/|http://search.mlp.cz/cz/titul/plac-housli/2330679/|http://search.mlp.cz/cz/titul/rethinking-the-university/57801/|http://search.mlp.cz/cz/titul/zapisky-stareho-prasaka/2488292/' > seznam
# BEWARE!
# quick and dirty
# HTTP and I/O wasted
# actual web-scrapping
cat seznam | perl -nE '`wget -qO- $_ 2>/dev/null` =~ m!class="ignac"[^<]+<strong>([^<]+)!sm && say "$1<br>" for split /\|/' > autori.html
# HTTP req wasted
cat seznam | perl -nE '`wget -qO- $_ 2>/dev/null` =~ m!([^>]+)</h1>! && say "$1<br>" for split /\|/' > knihy.html
# output
w3m knihy.html -dump | perl -wE 'print for map { ("$_\n", scalar <>,"#\n") } split /\n/, `w3m autori.html -dump`'
# remove TMP files
rm 'knihy.html' 'autori.html' &>/dev/null
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment