Skip to content

Instantly share code, notes, and snippets.

@crabmusket
Last active April 14, 2024 06:49
Show Gist options
  • Save crabmusket/f257a465c99df38ad22fe7d5f63bd8ca to your computer and use it in GitHub Desktop.
Save crabmusket/f257a465c99df38ad22fe7d5f63bd8ca to your computer and use it in GitHub Desktop.

https://superuser.com/a/1415765

The wget command you'll need to use is much lengthier as explained below. As such, you may wish to commit it to a file like wholesite.sh, make it an executable, and run it. It'll create a directory of the url and subdirectories of the site's assets, including images, js, css, etc.

wget \
     --recursive \
     --level 5 \
     --no-clobber \
     --page-requisites \
     --adjust-extension \
     --span-hosts \
     --convert-links \
     --restrict-file-names=windows \
     --domains yoursite.com \
     --no-parent \
         yoursite.com

Explanation

--recursive This specifies how many subdirectories of the site's assets you want to retrieve(since assets like images are often kept in subdirectories of the site) The default max depth to search for assets is 5 subdirectories. You can modify this with the level flag just below.

--level 5 Search through 5 subdirectories for assets. I'd recommend increasing or decreasing this if the target site is larger or smaller respectively.

--no-clobber Don't overwrite existing files.

--page-requisites causes wget to download all the files that are necessary to properly display a given HTML page which includes images, css, js, etc.

--adjust-extension Preserves proper file extensions for .html, .css, and other assets.

--span-hosts Include necessary assets from offsite as well.

--convert-links Update site links to work as files within subdirectories on your local machine(for viewing locally).

--restrict-file-names=windows Modify filenames to work in Windows as well, in case you're using this command on a Windows system.

--domains yoursite.com Do not follow links outside this domain.

--no-parent Don't follow links outside the directory you pass in.

yoursite.com # The URL to download

#!/bin/bash
# Use a modified version of the commands in the readme.
# Don't use Windows filenames, and don't adjust extensions, as we're going to be using the output
# to generate a tree of links to the original site, instead of using the actual site locally.
wget \
--recursive \
--level 10 \
--page-requisites \
--span-hosts \
--convert-links \
--no-parent \
--domains www.quakersaustralia.info \
https://www.quakersaustralia.info
# Now turn the scraped directory of files into a nice tree in HTML. The first argument
# is the base URL that links will point back to, so you can actually get to the real site
# from this index.
tree -H https://www.quakersaustralia.info ./www.quakersaustralia.info/ > tree.html
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment