-
Star
(370)
You must be signed in to star a gist -
Fork
(113)
You must be signed in to fork a gist
-
-
Save crittermike/fe02c59fed1aeebd0a9697cf7e9f5c0c to your computer and use it in GitHub Desktop.
| # One liner | |
| wget --recursive --page-requisites --adjust-extension --span-hosts --convert-links --restrict-file-names=windows --domains yoursite.com --no-parent yoursite.com | |
| # Explained | |
| wget \ | |
| --recursive \ # Download the whole site. | |
| --page-requisites \ # Get all assets/elements (CSS/JS/images). | |
| --adjust-extension \ # Save files with .html on the end. | |
| --span-hosts \ # Include necessary assets from offsite as well. | |
| --convert-links \ # Update links to still work in the static version. | |
| --restrict-file-names=windows \ # Modify filenames to work in Windows as well. | |
| --domains yoursite.com \ # Do not follow links outside this domain. | |
| --no-parent \ # Don't follow links outside the directory you pass in. | |
| yoursite.com/whatever/path # The URL to download |
Aggregating this command with other blog posts on the internet, I ended up using
wget --mirror --no-clobber --page-requisites --adjust-extension --span-hosts --convert-links --restrict-file-names=windows --domains {{DOMAINS}} --no-parent {{URL}}@FazleArefin thanks
My file names end with @ver=xx. How do I fix this?
wget --mirror --convert-links --adjust-extension --page-requisites --no-parent https://web.archive.org/web/20210628062523/https://www.ps-survival.com/PS/Hydro-Power/index.htm
Will download the .pdf
But if I change the domain to https://web.archive.org/web/20220118034512/https://ps-survival.com/PS/index.htm
IT doesn't go down and download the PDF's
Could someone tell me why that is? I'm trying to download all the PDF's.
@iceguru I'd try using an archive downloader. Wget doesn't play nicely with how they have it setup:
https://github.com/hartator/wayback-machine-downloader
I am also eyeing for this repo, since it can be directly hooked up to LLMs for use instead of using wget indirectly. https://pypi.org/project/pywebcopy/
Recently discovered random-wait as an option from here, should be included to make things less sus https://gist.github.com/stvhwrd/985dedbe1d3329e68d70
Just realized that --no-cobbler and --mirror conflicted, and such should use -l inf --recursive instead? https://stackoverflow.com/questions/13092229/cant-resume-wget-mirror-with-no-clobber-c-f-b-unhelpful
If you're going to use --recursive, then you need to use --level, and you should probably be polite and use --wait