Skip to content

Instantly share code, notes, and snippets.

View matpalm's full-sized avatar
🎯
Focusing

mat kelcey matpalm

🎯
Focusing
View GitHub Profile
index.search("europe").hits.each { |hit| puts hit.inspect }
#<struct Ferret::Search::Hit doc=6, score=0.446250796318054>
#<struct Ferret::Search::Hit doc=7, score=0.446250796318054>
#<struct Ferret::Search::Hit doc=8, score=0.446250796318054>
puts index[7].load.inspect
{:continent=>"europe", :name=>"London"}
@matpalm
matpalm / gist:923435
Created April 16, 2011 20:01
one_article_per_line.rb
#!/usr/bin/env ruby
article = ''
STDIN.each do |line|
begin
line.chomp!
if line == '---END.OF.DOCUMENT---'
puts "0\t#{article}"
article = ''
@matpalm
matpalm / gist:1057825
Created July 1, 2011 03:40
hadoop 0.18 counter grammar
# "a.b:1" => [["a", "b", 1]]
# "a.b:1,c.d:2" => [["a", "b", 1], ["c", "d", 2]]
grammar Hadoop18Counters
rule counter_records
counter_record ("," counter_record)* {
def to_list
items = elements[0].to_list
if elements[1]
@matpalm
matpalm / gist:1126892
Created August 5, 2011 03:56
download freebase via fifo
mkfifo articles
hadoop fs -copyFromLocal articles /full/articles-2011-07-08/freebase-wex-2011-07-08-articles.tsv &
curl http://download.freebase.com/wex/2011-07-08/freebase-wex-2011-07-08-articles.tsv.bz2 | bunzip2 > articles &
#win
@matpalm
matpalm / epic fail
Created February 8, 2012 01:01
epic fail, time to go home
print "weighted cost", weighted_cost
print "unweighted cost", weighted_cost
@matpalm
matpalm / url_crawl_freq_freq.tsv
Created February 9, 2012 04:46
url crawl freq freq
We can make this file beautiful and searchable if this error is corrected: No tabs found in this TSV file in line 0.
items in common crawl with a mime type text/html with at least one byte of visible text
see https://github.com/matpalm/common-crawl/tree/master/analysis
eg there are 880,125,891 urls that were been crawled once, 182,752,019 that were crawled twice, etc
times crawled freq
1 880125891
2 182752019
3 44573683
4 9448470
mat@matpc:/tmp$ echo "thе" | hexdump -C
00000000 74 68 d0 b5 0a |th...|
00000005
mat@matpc:/tmp$ echo "the" | hexdump -C
00000000 74 68 65 0a |the.|
00000004
@matpalm
matpalm / videos.js
Created March 8, 2012 05:34 — forked from csabapalfi/videos.js
Download Coursera videos
javascript:(function(){
$('a.lecture-link').each(function (index){
var $lectureLink = $(this);
var downloadLink = $lectureLink.attr('href').replace('view','download.mp4');
var downloadName = '\"' + (index+1) + '.' + $lectureLink.text().trim() + '.mp4\"';
var cookieHeader = ' --header \"Cookie:'+ document.cookie + '\" ';
console.log('curl -L ' + cookieHeader + downloadLink + ' > ' + downloadName);
});
})();
@matpalm
matpalm / gist:2402945
Created April 17, 2012 02:14
puzzle.py
import sys
# wget http://www.mieliestronk.com/corncob_lowercase.zip
# unzip corncob_lowercase.zip
# echo -e "i\na" >> corncob_lowercase.txt
q = []
words = set()
for word in map(str.strip, sys.stdin):