Created
July 28, 2011 00:10
-
-
Save jduckles/1110645 to your computer and use it in GitHub Desktop.
Commands from Mahout workshop Pt 1 at OSCON2011
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# See http://www.oscon.com/oscon2011/public/schedule/detail/18836 for getting Mahout setup | |
# Get Reuters Data | |
wget http://goo.gl/qv6Ad | |
mkdir reuters-out | |
mv reuters21578.tar.gz reuters-out | |
cd reuters-out | |
tar -xzvf reuters21578.tar.gz | |
cd .. | |
# Mahout steps | |
# slip out text from SGM files | |
bin/mahout org.apache.lucene.benchmark.utils.ExtractReuters reuters-out reuters-text | |
# Create sequence files | |
bin/mahout seqdirectory -i reuters-text -o reuters-seqfiles -c UTF-8 -chunk 5 | |
# Look at sequence files | |
bin/mahout seqdumper -s reuters-seqfiles/chunk-0 |less | |
# Seq 2 sparse will pull out sequences | |
bin/mahout seq2sparse -i reuters-seqfiles/ -o reuters-vectors -wt tfidf | |
# perform kmeans | |
bin/mahout kmeans -i reuters-vectors/tfidf-vectors/ -o mahout-clusters -c mahout-initial-centers -c 0.1 -k 20 -x 10 -ow | |
# look at output | |
bin/mahout clusterdump -s mahout-clusters/clusters-10/part-r-00000 -d reuters-vectors/dictionary.file-0 -dt sequencefile -b 100 -n 20 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment