Skip to content

Instantly share code, notes, and snippets.

@danbri
Created September 2, 2011 11:18
Show Gist options
  • Save danbri/1188400 to your computer and use it in GitHub Desktop.
Save danbri/1188400 to your computer and use it in GitHub Desktop.
#!/bin/sh
# Running in top level directory, per http://permalink.gmane.org/gmane.comp.apache.mahout.user/5689
# via https://cwiki.apache.org/MAHOUT/collocations.html
# I've tried this from top level dir, both with and without MAHOUT_LOCAL=true set.
# In both cases, I get seemingly nothing.
#
# e.g. running with cluster I got two files, and analysing
# ./bin/mahout seqdumper -s part-r-00001 ...gives
# Input Path: part-r-00001
# Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.hadoop.io.DoubleWritable
# Count: 0
M=./bin/mahout
echo $M seqdirectory -i ./examples/bin/work/reuters-out/ -o ./examples/bin/work/reuters-out-seqdir -c UTF-8 -chunk 5
$M seqdirectory -i ./examples/bin/work/reuters-out/ -o ./examples/bin/work/reuters-out-seqdir -c UTF-8 -chunk 5
echo
echo $M seq2sparse -i ./examples/bin/work/reuters-out-seqdir/ -o ./examples/bin/work/reuters-out-seqdir-sparse
$M seq2sparse -i ./examples/bin/work/reuters-out-seqdir/ -o ./examples/bin/work/reuters-out-seqdir-sparse
echo
echo $M org.apache.mahout.vectorizer.collocations.llr.CollocDriver -i ./examples/bin/work/reuters-out-seqdir-sparse/tokenized-documents -o \
-o ./examples/bin/work/reuters-colloc -a org.apache.mahout.vectorizer.DefaultAnalyzer -ng 3
$M org.apache.mahout.vectorizer.collocations.llr.CollocDriver -i ./examples/bin/work/reuters-out-seqdir-sparse/tokenized-documents -o \
-o ./examples/bin/work/reuters-colloc -a org.apache.mahout.vectorizer.DefaultAnalyzer -ng 3
echo
./bin/mahout seqdumper -s ./examples/bin/work/reuters-colloc/ngrams//part-r-00000
# | less
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment