Skip to content

Instantly share code, notes, and snippets.

@wenhuizhang
wenhuizhang / web_crawler.md
Last active April 13, 2020 18:43
web crawler
Name Language Platform
Heritrix Java Linux
Nutch Java Cross-platform
Scrapy Python Cross-platform
DataparkSearch C++ Cross-platform
GNU Wget C Linux
GRUB C#, C, Python, Perl Cross-platform
ht://Dig C++ Unix
HTTrack C/C++ Cross-platform
@jbtule
jbtule / justice-will-only-be-acheived.md
Last active April 5, 2023 13:33
Tracing the origin of an internet quote

Tracing the origin of an internet quote

So I saw this quote on twitter

"Justice will only be achieved when those who are not injured by crime feel as indignant as those who are."
–Solomon (635-577 B.C.)

Which I though was a real effective quote relating the

@gmcharlt
gmcharlt / del-linkroll.js
Created October 27, 2014 03:56
Resurrecting the Delicious Linkroll
/*
FILE ARCHIVED ON 8:24:05 Feb 21, 2012 AND RETRIEVED FROM THE
INTERNET ARCHIVE ON 3:48:08 Oct 27, 2014.
JAVASCRIPT APPENDED BY WAYBACK MACHINE, COPYRIGHT INTERNET ARCHIVE.
@mildred
mildred / download.sh
Created October 20, 2014 10:03
Download from archive.org Wayback Machine
#!/bin/bash
url=http://redefininggod.com
webarchive=https://web.archive.org
wget="wget -e robots=off -nv"
tab="$(printf '\t')"
additional_url=url.list
# Construct listing.txt from url.list
# The list of archived pages, including some wildcard url
@neilhawkins
neilhawkins / wayback-scraper.js
Created August 12, 2014 06:37
wayback-scraper.js
var casper = require("casper").create({
verbose: true,
loglevel: "debug"
});
var response = [], url, site, viewportWidth = 1280, viewportHeight = 1024;
var captureSize = {top: 0, left: 0, width: viewportWidth, height: viewportHeight};
// to get screenshot folder name from url
var re = /url=(\w+)\./;
@Smerity
Smerity / gist:2704d3d65aa191ff5f27
Last active May 1, 2017 19:45
About the data

Data Location

The Common Crawl dataset lives on Amazon S3 as part of the Amazon Public Datasets program. Downloading them is free from any instance on Amazon EC2, both via S3 and HTTP.

As the Common Crawl Foundation has evolved over the years, so has the format and metadata that accompany the crawls themselves.

  • [ARC] Archived Crawl #1 - s3://aws-publicdatasets/common-crawl/crawl-001/ - crawl data from 2008/2010
  • [ARC] Archived Crawl #2 - s3://aws-publicdatasets/common-crawl/crawl-002/ - crawl data from 2009/2010
  • [ARC] Archived Crawl #3 - s3://aws-publicdatasets/common-crawl/parse-output/ - crawl data from 2012
  • [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-20/
@Avyd
Avyd / irc_link_collector
Last active May 10, 2017 16:40
irc_link_collector
@hunner
hunner / bashbot.sh
Created May 31, 2014 01:42
Bash irc bot
#!/usr/bin/env bash
line=""
started=""
rm botfile
mkfifo botfile
tail -f botfile | nc irc.cat.pdx.edu 6667 | while true ; do
if [ -z $started ] ; then
echo "USER bdbot 0 bdbot :I iz a bot" > botfile
echo "NICK bdbot" >> botfile
@xem
xem / codegolf.md
Last active January 2, 2025 16:05
JS code golfing

codegolf JS

Mini projects by Maxime Euzière (xem), subzey, Martin Kleppe (aemkei), Mathieu Henri (p01), Litterallylara, Tommy Hodgins (innovati), Veu(beke), Anders Kaare, Keith Clark, Addy Osmani, bburky, rlauck, cmoreau, maettig, thiemowmde, ilesinge, adlq, solinca, xen_the,...

(For more info and other projects, visit http://xem.github.io)

(Official Slack room: http://jsgolf.club / join us on http://register.jsgolf.club)

@vivien
vivien / irccat
Created May 2, 2014 00:38
irccat - Using netcat with an IRC channel
#!/bin/sh
# Copyright 2014 Vivien Didelot <[email protected]>
# Licensed under the terms of the GNU GPL v3, or any later version.
NICK=irccat42
SERVER=irc.freenode.net
PORT=6667
CHAN="#irccat"
{