Skip to content

Instantly share code, notes, and snippets.

@whym
whym / format.py
Last active January 16, 2016 08:43
extract red-linked pages with the highest numbers of incoming links (for MediaWiki/Wikimedia)
#! /usr/bin/env python
# -*- coding: utf-8 -*-
import fileinput
from datetime import datetime
print '<!-- generated: %s -->' % datetime.strftime(datetime.now(), '%Y-%m-%dT%H:%M:%SZ')
for line in fileinput.input():
ns,page,n = line.strip().split('\t')
print '# [[%s]] ([[特別:Whatlinkshere/%s|%s 個のリンク]])' % (page, page, n)
@whym
whym / download_dumps.rb
Created January 4, 2012 04:10
download Wikimedia rev diff dumps, giving a different limit rate depending on day/night, and output md5sum to stdout
#! /usr/bin/env ruby
# download Wikimedia rev diff dumps, giving a different limit rate depending on day/night, and output md5sum to stdout
require 'open-uri'
require 'optparse'
require 'time'
USAGE= <<'END'
usage: download.rb http://dumps.wikimedia.org/enwiki/20111201/ --day-limit-rate=500k > checksums.txt
@whym
whym / split_revision_diffs.py
Created October 17, 2011 07:29
splitting revision diffs
#! /usr/bin/env python
# -*- coding: utf-8 -*-
# splitting revision diffs into files whose file names are the revision IDs
# see http://meta.wikimedia.org/wiki/WSoR_datasets/revision_diff for the input
import csv
import argparse
import sys
import os
@whym
whym / catscan_rewrite.php
Created July 4, 2011 18:19
a category query tool for Wikimedia wikis, forked from CatScan 2.0β: https://fisheye.toolserver.org/browse/Magnus/catscan_rewrite.php?r=117
<?PHP
/*
Rewrite of CatScan for Wikimedia Deutschland
(c) 2009 by Magnus Manske
Released under GPL
*/
error_reporting(E_ERROR|E_CORE_ERROR|E_ALL|E_COMPILE_ERROR);
ini_set('display_errors', 'On');
@whym
whym / acl2011-1st-pages.sh
Created June 16, 2011 21:45
download every PDF of ACL-HLT 2011 and extract the first pages of them and join into one PDF. pdfjam is required.
for i in `seq_p 1001 5006`; do wget http://www.aclweb.org/anthology/P/P11/P11-$i.pdf ; done && \
pdfjam P11-*.pdf 1
@whym
whym / anpi_geocode.py
Created March 17, 2011 07:11
ANPI NLP の <location> タグに緯度経度情報を付加するスクリプト
#! /usr/bin/env python
# -*- coding: utf-8 -*-
# assign latitude and longtude for <location> tags
# 使い方:
#
# 1. このファイルをanpi_geocode.pyとして保存する。
# 2. http://code.google.com/p/geopy/ をインストール(Pythonのeasy_install、またはソースから)。
# 3. http://code.google.com/intl/ja/apis/maps/signup.ht で取得したAPIキーを変数 apikey に代入。
@whym
whym / LocalSettings.JST.php
Created March 16, 2011 06:06
MediaWiki の LocalSettings.php に追記して、デフォルトタイムゾーンをJSTにするための設定
# see also http://www.mediawiki.org/wiki/Manual:Timezone#Primary_Method
#Set Default Timezone
$wgLocaltimezone = "Asia/Tokyo";
$oldtz = getenv("TZ");
putenv("TZ=$wgLocaltimezone");
# Versions before 1.7.0 used $wgLocalTZoffset as hours.
# After 1.7.0 offset as minutes
$wgLocalTZoffset = date("Z") / 60;
putenv("TZ=$oldtz");
@whym
whym / bgrep.cc
Created December 27, 2010 06:11
bgrep: a binary grep for fixed-size blocks
#include <iostream>
#include <sstream>
#include <fstream>
#include <cstring>
#include <cstdlib>
#define PREFIX "/tmp/frag."
#define BLOCKSIZE 16384
const char* bytesbytes(const char* a, const char* b, size_t as, size_t bs) {
@whym
whym / leechblock-dnsmasq.rb
Created October 28, 2010 05:58
insert dummy DNS entries for the domains specified in LeechBlock for Firefox
#!/usr/bin/env ruby
# insert dummy DNS entries for the domains specified in LeechBlock for Firefox
# usage: use this as a cron job
USAGE = <<"EOD"
usage: #{$0} [-f leechblock_exported_file]
EOD
@whym
whym / annotation.rb
Created May 22, 2010 15:19
classes for manipulating stand-off annotation
#! /usr/bin/env ruby
# -*- coding: utf-8; mode: ruby -*-
# TODO: rewrite referring Range
class Annotation
include Comparable
attr_reader :start, :end, :tag
attr_writer :start, :end, :tag