Skip to content

Instantly share code, notes, and snippets.

@marcinlawnik
Created March 16, 2014 13:13
Show Gist options
  • Save marcinlawnik/9583010 to your computer and use it in GitHub Desktop.
Save marcinlawnik/9583010 to your computer and use it in GitHub Desktop.
Scraping the darwin awards page
curl 'http://darwinawards.com/darwin/darwin[1993-2013].html' -o '#1.html'
cat *.html | grep -o -E 'href="([^"#]+)"' | cut -d'"' -f2 | sort | uniq | grep -o -E '^.*darwin.*$' | grep -v '/' | grep '-'
rm *.html
#!/bin/sh
while read line
do
echo "http://darwinawards.com/darwin/$line">>url2.txt
done < url.txt
wget -i url2.txt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment