pokeball99

Name	Language	Platform
Heritrix	Java	Linux
Nutch	Java	Cross-platform
Scrapy	Python	Cross-platform
DataparkSearch	C++	Cross-platform
GNU Wget	C	Linux
GRUB	C#, C, Python, Perl	Cross-platform
ht://Dig	C++	Unix
HTTrack	C/C++	Cross-platform

Tracing the origin of an internet quote

So I saw this quote on twitter

"Justice will only be achieved when those who are not injured by crime feel as indignant as those who are."
–Solomon (635-577 B.C.)

Which I though was a real effective quote relating the

Data Location

The Common Crawl dataset lives on Amazon S3 as part of the Amazon Public Datasets program. Downloading them is free from any instance on Amazon EC2, both via S3 and HTTP.

As the Common Crawl Foundation has evolved over the years, so has the format and metadata that accompany the crawls themselves.

[ARC] Archived Crawl #1 - s3://aws-publicdatasets/common-crawl/crawl-001/ - crawl data from 2008/2010
[ARC] Archived Crawl #2 - s3://aws-publicdatasets/common-crawl/crawl-002/ - crawl data from 2009/2010
[ARC] Archived Crawl #3 - s3://aws-publicdatasets/common-crawl/parse-output/ - crawl data from 2012
[WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-20/

codegolf JS

Mini projects by Maxime Euzière (xem), subzey, Martin Kleppe (aemkei), Mathieu Henri (p01), Litterallylara, Tommy Hodgins (innovati), Veu(beke), Anders Kaare, Keith Clark, Addy Osmani, bburky, rlauck, cmoreau, maettig, thiemowmde, ilesinge, adlq, solinca, xen_the,...

(For more info and other projects, visit http://xem.github.io)

(Official Slack room: http://jsgolf.club / join us on http://register.jsgolf.club)






	/*
	FILE ARCHIVED ON 8:24:05 Feb 21, 2012 AND RETRIEVED FROM THE
	INTERNET ARCHIVE ON 3:48:08 Oct 27, 2014.
	JAVASCRIPT APPENDED BY WAYBACK MACHINE, COPYRIGHT INTERNET ARCHIVE.

	#!/bin/bash

	url=http://redefininggod.com
	webarchive=https://web.archive.org
	wget="wget -e robots=off -nv"
	tab="$(printf '\t')"
	additional_url=url.list

	# Construct listing.txt from url.list
	# The list of archived pages, including some wildcard url

	var casper = require("casper").create({
	verbose: true,
	loglevel: "debug"
	});
	var response = [], url, site, viewportWidth = 1280, viewportHeight = 1024;
	var captureSize = {top: 0, left: 0, width: viewportWidth, height: viewportHeight};

	// to get screenshot folder name from url
	var re = /url=(\w+)\./;

	#!/bin/bash
	while true ;
	do
	echo "Collect the links..."
	for x in $(find /home/user/irclogs/ -name ".log" -print) ; do egrep --only-matching "http(s?):\/\/[^ \"\(\)\<\>]" $x \| tac ; done > output.txt
	echo "Let's wait for some seconds..."
	sleep 15
	done

	#!/usr/bin/env bash

	line=""
	started=""
	rm botfile
	mkfifo botfile
	tail -f botfile \| nc irc.cat.pdx.edu 6667 \| while true ; do
	if [ -z $started ] ; then
	echo "USER bdbot 0 bdbot :I iz a bot" > botfile
	echo "NICK bdbot" >> botfile

	#!/bin/sh
	# Copyright 2014 Vivien Didelot <[email protected]>
	# Licensed under the terms of the GNU GPL v3, or any later version.

	NICK=irccat42
	SERVER=irc.freenode.net
	PORT=6667
	CHAN="#irccat"

	{