Last active
October 2, 2017 22:15
-
-
Save pi0/c06a85e788b9b92020cf994b3f156862 to your computer and use it in GitHub Desktop.
Fully working crawler using wget
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
wget -nv -t3 -c -nH -r -l0 -k -p -np -e robots=off --reject-regex "\/\?(.*)" [url] | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
wget \ | |
--no-verbose \ | |
--tries=3 \ | |
--continue \ | |
--retry-connrefused \ | |
--no-host-directories \ | |
--recursive \ | |
--level 0 \ | |
--convert-links \ | |
--page-requisites \ | |
--no-parent \ | |
-e robots=off \ | |
--reject-regex "\/\?(.*)" \ | |
[url] | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
for f in `find . -regex ".*?.*"`; do mv -v $f $(echo $f | sed "s/\?.*//") ; done | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
for f in `find . -regex ".*?.*"`; | |
do mv -v $f $(echo $f | sed "s/\?.*//") | |
done | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Some notes:
For an IE fix, stylesheets usually contain fonts with a fake query param like
fontawesome-webfont.eot?v=4.2.0
We need to allow wget crawling such URLs too but prevent directory listing sort params. So reject-regex is more open. If you want to strictly prevent crawling any url with params use"(.*)\?(.*)"
.fix_names.sh
helps removing query from file names.--adjust-extension
or-E
flag is optional. Can be useful for sites that use friendly URLs without.html
extension. But use with caution as it may also have wrong behaviour like adding.html
to css files!