Skip to content

Instantly share code, notes, and snippets.

@pi0
Last active October 2, 2017 22:15
Show Gist options
  • Save pi0/c06a85e788b9b92020cf994b3f156862 to your computer and use it in GitHub Desktop.
Save pi0/c06a85e788b9b92020cf994b3f156862 to your computer and use it in GitHub Desktop.
Fully working crawler using wget
wget -nv -t3 -c -nH -r -l0 -k -p -np -e robots=off --reject-regex "\/\?(.*)" [url]
wget \
--no-verbose \
--tries=3 \
--continue \
--retry-connrefused \
--no-host-directories \
--recursive \
--level 0 \
--convert-links \
--page-requisites \
--no-parent \
-e robots=off \
--reject-regex "\/\?(.*)" \
[url]
for f in `find . -regex ".*?.*"`; do mv -v $f $(echo $f | sed "s/\?.*//") ; done
for f in `find . -regex ".*?.*"`;
do mv -v $f $(echo $f | sed "s/\?.*//")
done
@pi0
Copy link
Author

pi0 commented Oct 2, 2017

Some notes:

  • For an IE fix, stylesheets usually contain fonts with a fake query param like fontawesome-webfont.eot?v=4.2.0 We need to allow wget crawling such URLs too but prevent directory listing sort params. So reject-regex is more open. If you want to strictly prevent crawling any url with params use "(.*)\?(.*)". fix_names.sh helps removing query from file names.

  • --adjust-extension or -E flag is optional. Can be useful for sites that use friendly URLs without .html extension. But use with caution as it may also have wrong behaviour like adding .html to css files!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment