Skip to content

Instantly share code, notes, and snippets.

@dotherightthing
Last active August 17, 2019 18:53
Show Gist options
  • Save dotherightthing/517fec9c6218aaf9288f508a2a632fc2 to your computer and use it in GitHub Desktop.
Save dotherightthing/517fec9c6218aaf9288f508a2a632fc2 to your computer and use it in GitHub Desktop.
[Scraping Archive.org] Recovering old web projects. #scraping #waybackmachine #portfolio

Scraping Archive.org

Created: 2017.04.06

  1. Go to the Wayback Machine, enter a URL and take note of the start and end dates for the period where the site snapshots looked a particular way
  2. Download the wayback-machine-downloader
  3. Download the site:
$ cd download-directory
$ wayback_machine_downloader http://www.dansmith.co.nz/ --from 20081014030155 --to 20100526160621 --directory /Volumes/DanBackup/Websites/_archive/www.dansmith.co.nz/public/  

$ 264 files to download
  1. Create a local host for this site, so that absolute URLs load correctly, e.g. dansmith.dan
  2. Add this .htaccess file to the root of download-directory:
RewriteEngine on
# Load .php as .html
RewriteRule ^(.*)\.html$ $1.php [nc]

# Allow directory browsing
Options +Indexes 

# redirect external resource requests eg http://www.dansmith.co.nz/_resources/ui/images/web/telecom-20081004.jpg
# - not possible as htaccess only handles incoming requests
  1. Load a page, e.g. http://dansmith.dan/life/
  2. Page loads in full (x-scraping-archive-org-1.png)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment