Name | Language | Platform |
---|---|---|
Heritrix | Java | Linux |
Nutch | Java | Cross-platform |
Scrapy | Python | Cross-platform |
DataparkSearch | C++ | Cross-platform |
GNU Wget | C | Linux |
GRUB | C#, C, Python, Perl | Cross-platform |
ht://Dig | C++ | Unix |
HTTrack | C/C++ | Cross-platform |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/* | |
FILE ARCHIVED ON 8:24:05 Feb 21, 2012 AND RETRIEVED FROM THE | |
INTERNET ARCHIVE ON 3:48:08 Oct 27, 2014. | |
JAVASCRIPT APPENDED BY WAYBACK MACHINE, COPYRIGHT INTERNET ARCHIVE. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/bash | |
url=http://redefininggod.com | |
webarchive=https://web.archive.org | |
wget="wget -e robots=off -nv" | |
tab="$(printf '\t')" | |
additional_url=url.list | |
# Construct listing.txt from url.list | |
# The list of archived pages, including some wildcard url |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
var casper = require("casper").create({ | |
verbose: true, | |
loglevel: "debug" | |
}); | |
var response = [], url, site, viewportWidth = 1280, viewportHeight = 1024; | |
var captureSize = {top: 0, left: 0, width: viewportWidth, height: viewportHeight}; | |
// to get screenshot folder name from url | |
var re = /url=(\w+)\./; |
The Common Crawl dataset lives on Amazon S3 as part of the Amazon Public Datasets program. Downloading them is free from any instance on Amazon EC2, both via S3 and HTTP.
As the Common Crawl Foundation has evolved over the years, so has the format and metadata that accompany the crawls themselves.
- [ARC] Archived Crawl #1 - s3://aws-publicdatasets/common-crawl/crawl-001/ - crawl data from 2008/2010
- [ARC] Archived Crawl #2 - s3://aws-publicdatasets/common-crawl/crawl-002/ - crawl data from 2009/2010
- [ARC] Archived Crawl #3 - s3://aws-publicdatasets/common-crawl/parse-output/ - crawl data from 2012
- [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-20/
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/bash | |
while true ; | |
do | |
echo "Collect the links..." | |
for x in $(find /home/user/irclogs/ -name "*.log" -print) ; do egrep --only-matching "http(s?):\/\/[^ \"\(\)\<\>]*" $x | tac ; done > output.txt | |
echo "Let's wait for some seconds..." | |
sleep 15 | |
done |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env bash | |
line="" | |
started="" | |
rm botfile | |
mkfifo botfile | |
tail -f botfile | nc irc.cat.pdx.edu 6667 | while true ; do | |
if [ -z $started ] ; then | |
echo "USER bdbot 0 bdbot :I iz a bot" > botfile | |
echo "NICK bdbot" >> botfile |
Mini projects by Maxime Euzière (xem), subzey, Martin Kleppe (aemkei), Mathieu Henri (p01), Litterallylara, Tommy Hodgins (innovati), Veu(beke), Anders Kaare, Keith Clark, Addy Osmani, bburky, rlauck, cmoreau, maettig, thiemowmde, ilesinge, adlq, solinca, xen_the,...
(For more info and other projects, visit http://xem.github.io)
(Official Slack room: http://jsgolf.club / join us on http://register.jsgolf.club)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/sh | |
# Copyright 2014 Vivien Didelot <[email protected]> | |
# Licensed under the terms of the GNU GPL v3, or any later version. | |
NICK=irccat42 | |
SERVER=irc.freenode.net | |
PORT=6667 | |
CHAN="#irccat" | |
{ |