Skip to content

Instantly share code, notes, and snippets.

View danbri's full-sized avatar

Dan Brickley danbri

View GitHub Profile
#!/usr/bin/ruby
# Script that consumes Mahout Collocations, sorts them.
# Example input: Key: 00 a.m: Value: 53.017824619466865
#
# Once we have these, we can go back and look for associations between docs and these collocations
# e.g. find . -exec grep -il 'antibiotic sensitivity' {} \;
# several occurances of'antibiotic resistance' in paul_ewald_asks_can_we_domesticate_germs.html
#
# See also http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html
TellyClub:trunk danbri$ examples/bin/build-reuters.sh
Please select a number to choose the corresponding clustering algorithm
1. kmeans clustering
2. lda clustering
Enter your choice : 1
ok. You chose 1 and we'll use kmeans Clustering
11/09/13 10:24:55 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9000. Already tried 0 time(s).
11/09/13 10:30:23 INFO common.AbstractJob: Command line arguments: {--dictionary=mahout-work/reuters-out-seqdir-sparse-kmeans/dictionary.file-0, --dictionaryType=sequencefile, --endPhase=2147483647, --numWords=20, --seqFileDir=mahout-work/reuters-kmeans/clusters-10, --startPhase=0, --substring=100, --tempDir=temp}
:CL-15706{n=519 c=[0:0.014, 0.1:0.038, 0.2:0.013, 0.3:0.024, 0.4:0.013, 0.5:0.012, 0.7:0.031, 0.8:0.0
Top Terms:
vs => 7.6343705245295475
net => 4.940552704136725
mln => 4.394683003884979
shr => 4.391380775870616
cts => 4.295353677231453
loss => 4.157557884392711
oper => 3.606452024051909
#!/bin/bash
#
# The Mahout command script
#
# Environment Variables
#
# MAHOUT_JAVA_HOME The java implementation to use. Overrides JAVA_HOME.
#
# MAHOUT_HEAPSIZE The maximum amount of heap to use, in MB.
# Default is 1000.
#!/usr/bin/ruby
# Read the BBC iPlayer site, and take note of the URLs for potentially playable items
# Currently we ignore the detail of embedded JSON, and just extract pids.
sitemap = `curl -s http://www.bbc.co.uk/iplayer/sitemap.xml.gz | gunzip - | grep '<loc>'`
done = []
topics = []
sitemap.each do |sm|
select distinct ?d ?i ?c ?t WHERE {
?d <http://purl.org/dc/terms/subject> ?s .
?d <http://purl.org/dc/terms/title> ?t .
?d <http://purl.org/ontology/bibo/isbn> ?i .
?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#value> ?c .
}
Top Terms:
social_sciences => 6.3393223616576275
power_ => 3.2031174961782436
elite_ => 2.4367908532084943
_united_states => 0.4032046124604249
consensus_ => 0.272245196685248
functionalism_ => 0.23277774145594696
sociology => 0.18494324667173773
philosophy => 0.18243420760402476
_china => 0.17777614661383034
<graphml xmlns="http://graphml.graphdrawing.org/xmlns">
<key attr.name="label" attr.type="string" for="node" id="label"/>
<key attr.name="Edge Label" attr.type="string" for="edge" id="edgelabel"/>
<key attr.name="weight" attr.type="double" for="edge" id="weight"/>
<key attr.name="Edge Id" attr.type="string" for="edge" id="edgeid"/>
<key attr.name="r" attr.type="int" for="node" id="r"/>
<key attr.name="g" attr.type="int" for="node" id="g"/>
<key attr.name="b" attr.type="int" for="node" id="b"/>
<key attr.name="x" attr.type="float" for="node" id="x"/>
<key attr.name="y" attr.type="float" for="node" id="y"/>
MAHOUT_LOCAL is set, so we don't add HADOOP_CONF_DIR to classpath.
MAHOUT_LOCAL is set, running locally
117,/The_Psychopath_Test__A_Journey_Through_the_Madness_Industry_1.txt,/The_Psychopath_Test__A_Journey_Through_the_Madness_Industry_100.txt,/The_Psychopath_Test__A_Journey_Through_the_Madness_Industry_101.txt,/The_Psychopath_Test__A_Journey_Through_the_Madness_Industry_102.txt,/The_Psychopath_Test__A_Journey_Through_the_Madness_Industry_103.txt,/The_Psychopath_Test__A_Journey_Through_the_Madness_Industry_104.txt,/The_Psychopath_Test__A_Journey_Through_the_Madness_Industry_109.txt,/The_Psychopath_Test__A_Journey_Through_the_Madness_Industry_110.txt,/The_Psychopath_Test__A_Journey_Through_the_Madness_Industry_111.txt,/The_Psychopath_Test__A_Journey_Through_the_Madness_Industry_118.txt,/The_Psychopath_Test__A_Journey_Through_the_Madness_Industry_119.txt,/The_Psychopath_Test__A_Journey_Through_the_Madness_Industry_120.txt,/The_Psychopath_Test__A_Journey_Through_the_Madness_Industry_121.txt,/The_Psychopath_Test_
Key: zoo story: Value: 99.5501602332726
Key: zina zina: Value: 24.524853792255954
Key: zina tells: Value: 18.198242925620775
Key: zina sammy: Value: 16.90048778841856
Key: zina joe: Value: 15.925026676875632
Key: zina her: Value: 10.134937961312062
Key: yourself your: Value: 3.7305085407751903
Key: yourself you: Value: 2.3517997572998866
Key: yourself what: Value: 2.5757062135444357
Key: yourself wandering: Value: 20.21486502949483