simonw/wget.md

Created December 9, 2016 06:38

Star (59) You must be signed in to star a gist
Fork (12) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/simonw/27e810771137408fd7834ad153750c41.js"></script>
Save simonw/27e810771137408fd7834ad153750c41 to your computer and use it in GitHub Desktop.

Recursive wget ignoring robots

Raw

$ wget -e robots=off -r -np 'http://example.com/folder/'

-e robots=off causes it to ignore robots.txt for that domain
-r makes it recursive
-np = no parents, so it doesn't follow links up to the parent folder

wodim commented Feb 17, 2019

Remember that the -r option has a default maximum depth of 5. I think --mirror is, overall, a better choice.

ghost commented May 25, 2019

-r -l 0 removes the maximum depth limit.

taoyichen commented Jan 14, 2020

wget -e robots=off -r -np --page-requisites --convert-links
For websites

tumelo-mapheto commented Jun 25, 2020

Thanks for sharing.

fsiler commented Feb 28, 2021

I'm still getting no-follow attribute found in $URL. Will not follow any links on this page after using wget -e robots=off -r -np --page-requisites --convert-links $SITE. Is this a bug?

NilsIrl commented Apr 16, 2021 •

edited

Loading

I'm still getting no-follow attribute found in $URL. Will not follow any links on this page after using wget -e robots=off -r -np --page-requisites --convert-links $SITE. Is this a bug?

Yes, this is a bug, it should be fixed in the next version of wget: https://git.savannah.gnu.org/cgit/wget.git/commit/?id=f1cccd2c454fb416e75a22b358b0a11266642007

See https://www.reddit.com/r/DataHoarder/comments/mprq89/wget_respects_nofollow_attribute_despite_e/guct2s5/ for more details