Skip to content

Instantly share code, notes, and snippets.

@CodeBrauer
Last active August 15, 2018 13:34
Show Gist options
  • Save CodeBrauer/79be0dc44cfd29f8a708583e0d84e259 to your computer and use it in GitHub Desktop.
Save CodeBrauer/79be0dc44cfd29f8a708583e0d84e259 to your computer and use it in GitHub Desktop.
Get all snapshots from archive.org as list
{
"require": {
"fabpot/goutte": "^3.2",
"digitalnature/php-ref": "^1.2"
}
}
<?php
require_once 'vendor/autoload.php';
use Goutte\Client;
$url = "http://medoo.in/";
/**
* get first snapshot url from archive.org
* @param string $url url of the archived website
* @param integer $year year of the first snapshot (default: 1996)
* @return string url
* @author Leksat <http://stackoverflow.com/a/11699301/1990745>
*/
function getFirstSnapshot($url, $year = 1996) {
$waybackurl = "https://web.archive.org/web/$year/$url"; // < redirects to first snapshot
$ch = curl_init($waybackurl);
curl_setopt_array($ch, [
CURLOPT_RETURNTRANSFER => true,
CURLOPT_HEADER => true,
CURLOPT_FOLLOWLOCATION => false,
]);
$response = curl_exec($ch);
preg_match_all('/^Location:(.*)$/mi', $response, $matches);
curl_close($ch);
return !empty($matches[1]) ? trim($matches[1][0]) : false;
}
$first_url = getFirstSnapshot($url);
if ($first_url) {
preg_match('/\d{4}+/', $first_url, $firstFoundYear); // find year
$firstFoundYear = (int)$firstFoundYear[0];
} else {
die('Could not find first snapshot');
}
$foundUrls = [];
foreach (range($firstFoundYear, (int)date('Y')) as $year) {
echo ">> Year:" . $year . PHP_EOL;
$client = new Client();
$crawler = $client->request('GET', "https://web.archive.org/web/$year*/$url"); // '*' => show calendar
$maxpage = $crawler->filter('.day a[href^="/web/"][href$="'.$url.'"]')->each(function ($node) use ($foundUrls) {
echo $foundUrls[] = $node->attr('href');
echo PHP_EOL;
});
}
@CodeBrauer
Copy link
Author

Example out:

>> Year:2013
/web/20130403013843/http://medoo.in/
/web/20130501041949/http://medoo.in/
/web/20130516125616/http://medoo.in/
/web/20130613153655/http://medoo.in/
/web/20130621033258/http://medoo.in/
/web/20130708010021/http://medoo.in/
/web/20130709212627/http://medoo.in/
/web/20130727125806/http://medoo.in/
/web/20131023113617/http://medoo.in/
/web/20131202193450/http://medoo.in/
>> Year:2014
/web/20140105231359/http://medoo.in/
/web/20140207210314/http://medoo.in/
/web/20140208003059/http://medoo.in/
/web/20140517210600/http://medoo.in/
/web/20140625034952/http://medoo.in/
/web/20140909050403/http://medoo.in/
/web/20140920190041/http://medoo.in/
/web/20140929224459/http://medoo.in/
/web/20141007084054/http://medoo.in/
/web/20141015091103/http://medoo.in/
/web/20141025123652/http://medoo.in/
/web/20141102105426/http://medoo.in/
/web/20141111111538/http://medoo.in/
/web/20141208233916/http://medoo.in/
/web/20141217174946/http://medoo.in/
/web/20141219225452/http://medoo.in/
/web/20141229103307/http://medoo.in/
>> Year:2015
/web/20150107011340/http://medoo.in/
/web/20150205221451/http://medoo.in/
/web/20150207000759/http://medoo.in/
/web/20150215050024/http://medoo.in/
/web/20150222010513/http://medoo.in/
/web/20150302051614/http://medoo.in/
/web/20150310072844/http://medoo.in/
/web/20150313003257/http://medoo.in/
/web/20150314191740/http://medoo.in/
/web/20150315013016/http://medoo.in/
/web/20150318021038/http://medoo.in/
/web/20150326001907/http://medoo.in/
/web/20150403223003/http://medoo.in/
/web/20150423175935/http://medoo.in/
/web/20150430030749/http://medoo.in/
/web/20150507072556/http://medoo.in/
/web/20150515230029/http://medoo.in/
/web/20150801231401/http://medoo.in/
/web/20150814004911/http://medoo.in/
/web/20150821024614/http://medoo.in/
/web/20151006020405/http://medoo.in/
/web/20151013102549/http://medoo.in/
/web/20151015202258/http://medoo.in/
/web/20151022125938/http://medoo.in/
/web/20151029234804/http://medoo.in/
/web/20151030094047/http://medoo.in/
/web/20151106032649/http://medoo.in/
/web/20151107055939/http://medoo.in/
/web/20151113042930/http://medoo.in/
/web/20151115151501/http://medoo.in/
/web/20151120063802/http://medoo.in/
/web/20151126054614/http://medoo.in/
/web/20151127081910/http://medoo.in/
/web/20151204094807/http://medoo.in/
/web/20151211130307/http://medoo.in/
/web/20151218170129/http://medoo.in/
/web/20151225175159/http://medoo.in/
/web/20151226174601/http://medoo.in/
>> Year:2016
/web/20160101201018/http://medoo.in/
/web/20160106124118/http://medoo.in/
/web/20160109004908/http://medoo.in/
/web/20160111135746/http://medoo.in/
/web/20160115123459/http://medoo.in/
/web/20160116133233/http://medoo.in/
/web/20160123182542/http://medoo.in/
/web/20160126044455/http://medoo.in/
/web/20160130161702/http://medoo.in/
/web/20160203075850/http://medoo.in/
/web/20160210132532/http://medoo.in/
/web/20160212043615/http://medoo.in/
/web/20160217171740/http://medoo.in/
/web/20160220105855/http://medoo.in/
/web/20160225072739/http://medoo.in/
/web/20160226134010/http://medoo.in/
/web/20160227030256/http://medoo.in/
/web/20160304012341/http://medoo.in/
/web/20160305003626/http://medoo.in/
/web/20160306063037/http://medoo.in/
/web/20160312033402/http://medoo.in/
/web/20160313223248/http://medoo.in/
/web/20160319093915/http://medoo.in/
/web/20160321000757/http://medoo.in/
/web/20160326131817/http://medoo.in/
/web/20160329121958/http://medoo.in/
/web/20160402121311/http://medoo.in/
/web/20160406105856/http://medoo.in/
/web/20160409133448/http://medoo.in/
/web/20160414082919/http://medoo.in/
/web/20160416154109/http://medoo.in/
/web/20160421212251/http://medoo.in/
/web/20160423170146/http://medoo.in/
/web/20160430134822/http://medoo.in/
/web/20160505072115/http://medoo.in/
/web/20160508212635/http://medoo.in/
/web/20160517143806/http://medoo.in/
/web/20160602091257/http://medoo.in/
/web/20160609113656/http://medoo.in/
/web/20160616135035/http://medoo.in/
/web/20160629164649/http://medoo.in/
/web/20160708183845/http://medoo.in/
/web/20160812052156/http://medoo.in/
/web/20160819124441/http://medoo.in/
/web/20160826121036/http://medoo.in/
/web/20161007124748/http://medoo.in/

@CodeBrauer
Copy link
Author

After getting this far, just added this small part:

foreach ($foundUrls as $furl) {
    preg_match('/\d+/', $furl, $timestamp);
    $timestamp = $timestamp[0];
    echo "wayback_machine_downloader $url -t $timestamp -d $timestamp" . PHP_EOL;
}

Which could be just executed like this: php index.php | bash and I got all my stuff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment