Skip to content

Instantly share code, notes, and snippets.

@simonw
Created December 9, 2016 06:38
Show Gist options
  • Save simonw/27e810771137408fd7834ad153750c41 to your computer and use it in GitHub Desktop.
Save simonw/27e810771137408fd7834ad153750c41 to your computer and use it in GitHub Desktop.
Recursive wget ignoring robots
$ wget -e robots=off -r -np 'http://example.com/folder/'
  • -e robots=off causes it to ignore robots.txt for that domain
  • -r makes it recursive
  • -np = no parents, so it doesn't follow links up to the parent folder
@taoyichen
Copy link

wget -e robots=off -r -np --page-requisites --convert-links
For websites

@tumelo-mapheto
Copy link

Thanks for sharing.

@fsiler
Copy link

fsiler commented Feb 28, 2021

I'm still getting no-follow attribute found in $URL. Will not follow any links on this page after using wget -e robots=off -r -np --page-requisites --convert-links $SITE. Is this a bug?

@NilsIrl
Copy link

NilsIrl commented Apr 16, 2021

I'm still getting no-follow attribute found in $URL. Will not follow any links on this page after using wget -e robots=off -r -np --page-requisites --convert-links $SITE. Is this a bug?

Yes, this is a bug, it should be fixed in the next version of wget: https://git.savannah.gnu.org/cgit/wget.git/commit/?id=f1cccd2c454fb416e75a22b358b0a11266642007

See https://www.reddit.com/r/DataHoarder/comments/mprq89/wget_respects_nofollow_attribute_despite_e/guct2s5/ for more details

@thewhitegrizzli
Copy link

not fixed

@jimsy3
Copy link

jimsy3 commented Dec 6, 2023

what is the recursive thing?

@ibrahemesam
Copy link

Thanks for sharing <3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment