Skip to content

Instantly share code, notes, and snippets.

@sbrin
Last active December 19, 2015 08:39
Show Gist options
  • Save sbrin/5927664 to your computer and use it in GitHub Desktop.
Save sbrin/5927664 to your computer and use it in GitHub Desktop.
Wget sitemap spider
wget --spider -o wget.log -e robots=off --wait 3 -r -p -S http://
grep -ri 'http://' wget.log | grep -E -v '(files/|\.jpg|\.jpeg|\.gif|\.css|\.js|\.pdf|\.png|\.xls)' | awk '{print $3}'|sort|uniq|sort > site_map.txt
cat $1 |grep -i -E -v '(\.jpg|\.jpeg|\.gif|\.css|\.js|\.pdf|\.png|\.xls|\.ico|\.txt|\.doc|yandexbot|googlebot|YandexDirect|\/upload\/|" 404 |" 301 |" 302 )'|perl -MURI::Escape -lne 'print uri_unescape($_)'|grep yandsearch|awk '{print $1}'|sort|uniq|wc -l
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment