Skip to content

Instantly share code, notes, and snippets.

@fbender
Last active September 1, 2015 16:43
Show Gist options
  • Save fbender/9b40f6af7a88c5880602 to your computer and use it in GitHub Desktop.
Save fbender/9b40f6af7a88c5880602 to your computer and use it in GitHub Desktop.
`wget` script to mirror / backup a site and its subressources even if they span across different domains
#!/bin/bash
# `wget` script to mirror / backup a site and its subressources even if they span across different domains
#
# Author: Florian Bender
# Updated: 2015-08-31
# License: MIT
#
# Copyright (c) 2015 Florian Bender <[email protected]>
#
# Permission is hereby granted, free of charge, to any person
# obtaining a copy of this software and associated documentation
# files (the "Software"), to deal in the Software without
# restriction, including without limitation the rights to use,
# copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the
# Software is furnished to do so, subject to the following
# conditions:
#
# The above copyright notice and this permission notice shall be
# included in all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
# EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
# OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
# NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
# HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
# WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
# OTHER DEALINGS IN THE SOFTWARE.
# some sources:
# - `man wget`
# - https://darcynorman.net/2011/12/24/archiving-a-wordpress-website-with-wget/
# - http://superuser.com/questions/129085/make-wget-download-page-resources-on-a-different-domain
selfname=`basename "$0"`
if [ -z "$1" ] ; then
echo "Usage: ${selfname}" '$URL [$SPAN_HOST ...]'
exit 1
fi
# save first parameter as domain to crawl (and initialize other vars), then shift parameters
source_host="$1"
span_hosts=""
wget_hosts=""
shift
# wget flags (from `man`; options that are set explicitly `+` or implicitly `=`)
# -D --domains=<domains> Set domains to be followed. <domains> is a comma-separated list of domains. See [-H]
# + -E --adjust-extension […] this option will cause the suffix .html to be appended to the local filename.
# + -e .wgetrc equivalent, here: disrespect robots.txt & rel=nofollow
# -H --span-hosts Enable spanning across hosts when doing recursive retrieving.
# -K --backup-converted keeps .orig before conversion
# + -k --convert-links {}
# = -l --level=<N> Specify recursion maximum depth level depth.
# + -m --mirror equiv. of -r -N -l inf --no-remove-listing
# = -N --timestamping Turn on time-stamping.
# + -np --no-parent do not traverse above current directory
# + -p --page-requisites {}
# = -r --recursive {}
# = --no-remove-listing Don't remove the temporary .listing files generated by FTP retrievals.
# = --html-extension [DEPRECATED] use -E instead
# Specify the host to crawl.
wget_hosts="$source_host"
# Since we shifted the parameters, we check for the second (and more) parameters
# by checking the (now) first parameter. If there are more parameters, join them
# into a comma-separated list for consumption by `wget` and overwrite the existing
# variable.
# specifies wget parameters:
# + -D --domains=<domains> Set domains to be followed. <domains> is a comma-separated list of domains. See [-H]
# + -H --span-hosts Enable spanning across hosts when doing recursive retrieving.
if [ -n "$1" ] ; then
# automatically specify -H and domains
span_hosts=`printf "%s," $* | cut -d "," -f 1-$#`
wget_hosts="-H -D $source_host,$span_hosts $source_host"
fi
# Create and announce the command to be executed.
wget_command="wget -E -e robots=off -k -m -np -p $wget_hosts"
echo "$selfname: [DEBUG] will run '$wget_command'" >&2
# Start crawling and exit with the success code.
$wget_command
wget_result=$?
echo "$selfname: [DEBUG] finished '$wget_command'" >&2
exit $wget_result
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment