dotherightthing/scraping-archive-org.md

Last active August 17, 2019 18:53

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/dotherightthing/517fec9c6218aaf9288f508a2a632fc2.js"></script>
Save dotherightthing/517fec9c6218aaf9288f508a2a632fc2 to your computer and use it in GitHub Desktop.

Download ZIP

[Scraping Archive.org] Recovering old web projects. #scraping #waybackmachine #portfolio

Raw

scraping-archive-org.md

Scraping Archive.org

Created: 2017.04.06

Go to the Wayback Machine, enter a URL and take note of the start and end dates for the period where the site snapshots looked a particular way
Download the wayback-machine-downloader
Download the site:

$ cd download-directory
$ wayback_machine_downloader http://www.dansmith.co.nz/ --from 20081014030155 --to 20100526160621 --directory /Volumes/DanBackup/Websites/_archive/www.dansmith.co.nz/public/  

$ 264 files to download

Create a local host for this site, so that absolute URLs load correctly, e.g. dansmith.dan
Add this .htaccess file to the root of download-directory:

RewriteEngine on
# Load .php as .html
RewriteRule ^(.*)\.html$ $1.php [nc]

# Allow directory browsing
Options +Indexes 

# redirect external resource requests eg http://www.dansmith.co.nz/_resources/ui/images/web/telecom-20081004.jpg
# - not possible as htaccess only handles incoming requests

Load a page, e.g. http://dansmith.dan/life/
Page loads in full (x-scraping-archive-org-1.png)

Raw

x-scraping-archive-org-1.png

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment