Skip to content

Instantly share code, notes, and snippets.

View danbri's full-sized avatar

Dan Brickley danbri

View GitHub Profile
sh build-reuters.sh
Please select a number to choose the corresponding clustering algorithm
1. kmeans clustering
2. lda clustering
Enter your choice : 1
ok. You chose 1 and we'll use kmeans Clustering
Downloading Reuters-21578
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 7959k 100 7959k 0 0 105k 0 0:01:15 0:01:15 --:--:-- 155k
<message from="[email protected]/TellyClub" type="chat" to="[email protected]/danko2" >
<body>{"id":"b008v131","pid":"b008v131", "video":"http://g.bbcredux.com/programme/bbcthree/2011-01-25/23-00-00","title":"HELLO LIBBY", "image":"http://upload.wikimedia.org/wikipedia/commons/6/6d/Rick_Astley_-_Pepsifest_2009.jpg","description":"Some description goes here", "nick":"danko2"}</body>
</message>
Script started on Fri Sep 2 13:15:22 2011
bash-3.2$ MAHOUT_LOCAL=true sh colloc-reuters.sh
./bin/mahout seqdirectory -i ./examples/bin/work/reuters-out/ -o ./examples/bin/work/reuters-out-seqdir -c UTF-8 -chunk 5
MAHOUT_LOCAL is set, so we don't add HADOOP_CONF_DIR to classpath.
MAHOUT_LOCAL is set, running locally
CLASSPATH: :/Users/danbri/working/android/sdk:/Users/danbri/working/mahout/trunk/src/conf:/System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home/lib/tools.jar:/Users/danbri/working/mahout/trunk/mahout-*.jar:/Users/danbri/working/mahout/trunk/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar:/Users/danbri/working/mahout/trunk/mahout-examples-*-job.jar:/Users/danbri/working/mahout/trunk/lib/*.jar:/Users/danbri/working/mahout/trunk/examples/target/dependency/antlr-2.7.7.jar:/Users/danbri/working/mahout/trunk/examples/target/dependency/antlr-3.2.jar:/Users/danbri/working/mahout/trunk/examples/target/dependency/antlr-runtime-3.2.jar:/Users/danbri/working/mahout/trunk/examples/target/depen
#!/bin/sh
# Running in top level directory, per http://permalink.gmane.org/gmane.comp.apache.mahout.user/5689
# via https://cwiki.apache.org/MAHOUT/collocations.html
# I've tried this from top level dir, both with and without MAHOUT_LOCAL=true set.
# In both cases, I get seemingly nothing.
#
# e.g. running with cluster I got two files, and analysing
# ./bin/mahout seqdumper -s part-r-00001 ...gives
bash-3.2$ cat miglib.pig
-- Mahout Pig integration
-- only proper piglatin can go in an imported macro; file-management, jar registration etc. has
-- to be run via .pig files.
-- We need piggybank.jar for reading Mahout's Hadoop Sequence files, plus other utilities:
--
REGISTER /Users/bandri/working/pig/pig-0.9.0/contrib/piggybank/java/piggybank.jar;
grunt> mydir = seqdirectory('ted/txt/', 'ted/foo', IGNORE);
grunt> dump mydir;
2011-09-04 17:33:58,860 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: NATIVE
2011-09-04 17:33:59,373 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
2011-09-04 17:33:59,558 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 3
2011-09-04 17:33:59,558 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 3
2011-09-04 17:33:59,631 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job
2011-09-04 17:33:59,654 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buf
_:genid1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person> .
_:genid1 <http://xmlns.com/foaf/0.1/nick> "giul"@en .
_:genid1 <http://xmlns.com/foaf/0.1/name> "giul"@en .
_:genid1 <http://www.livejournal.org/rss/lj/1.0/journaltitle> "Eariel - t.A.T.u. Live Journal"@en .
_:genid1 <http://xmlns.com/foaf/0.1/openid> <http://giul.livejournal.com/> .
<http://www.livejournal.com/directory.bml?opt_sort=ut&s_loc=1&loc_cn=IT> <http://purl.org/dc/elements/1.1/title> "IT" .
_:genid1 <http://blogs.yandex.ru/schema/foaf/country> <http://www.livejournal.com/directory.bml?opt_sort=ut&s_loc=1&loc_cn=IT> .
<http://www.livejournal.com/directory.bml?opt_sort=ut&s_loc=1&loc_cn=IT&loc_st=&loc_ci=Rome> <http://purl.org/dc/elements/1.1/title> "Rome" .
_:genid1 <http://blogs.yandex.ru/schema/foaf/city> <http://www.livejournal.com/directory.bml?opt_sort=ut&s_loc=1&loc_cn=IT&loc_st=&loc_ci=Rome> .
_:genid1 <http://xmlns.com/foaf/0.1/img> <http://l-userpic.livejournal.com/94039030/23437353> .
TellyClub:trunk danbri$ sh spectral.sh
Running on hadoop, using HADOOP_HOME=/Users/danbri/working/hadoop/hadoop-0.20.2
HADOOP_CONF_DIR=/Users/danbri/working/hadoop/hadoop-0.20.2/conf
MAHOUT-JOB: /Users/danbri/working/mahout/trunk/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
11/09/07 14:22:46 WARN driver.MahoutDriver: No spectralkmeans.props found on classpath, will use command-line arguments only
11/09/07 14:22:46 INFO common.AbstractJob: Command line arguments: {--clusters=2, --convergenceDelta=0.5, --dimensions=37, --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure, --endPhase=2147483647, --input=speccy, --maxIter=10, --output=specout, --startPhase=0, --tempDir=temp}
11/09/07 14:22:46 INFO common.HadoopUtil: Deleting specout/calculations/seqfile-248
11/09/07 14:22:47 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
11/09/07 14:22:51 INFO input.FileInputFormat: Total input paths to process : 2
Index: core/src/main/java/org/apache/mahout/clustering/spectral/common/VectorMatrixMultiplicationJob.java
===================================================================
--- core/src/main/java/org/apache/mahout/clustering/spectral/common/VectorMatrixMultiplicationJob.java (revision 1163723)
+++ core/src/main/java/org/apache/mahout/clustering/spectral/common/VectorMatrixMultiplicationJob.java (working copy)
@@ -78,6 +78,9 @@
FileInputFormat.addInputPath(job, markovPath);
FileOutputFormat.setOutputPath(job, outputPath);
+
+ job.setJarByClass(VectorMatrixMultiplicationJob.class);
TellyClub:trunk danbri$ sh spectral.sh
Running on hadoop, using HADOOP_HOME=/Users/danbri/working/hadoop/hadoop-0.20.2
HADOOP_CONF_DIR=/Users/danbri/working/hadoop/hadoop-0.20.2/conf
MAHOUT-JOB: /Users/danbri/working/mahout/trunk/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
11/09/07 14:37:49 WARN driver.MahoutDriver: No spectralkmeans.props found on classpath, will use command-line arguments only
11/09/07 14:37:49 INFO common.AbstractJob: Command line arguments: {--clusters=2, --convergenceDelta=0.5, --dimensions=37, --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure, --endPhase=2147483647, --input=speccy, --maxIter=10, --output=specout, --startPhase=0, --tempDir=temp}
11/09/07 14:37:50 INFO common.HadoopUtil: Deleting specout/calculations/seqfile-112
11/09/07 14:37:50 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
11/09/07 14:37:51 INFO input.FileInputFormat: Total input paths to process :