Skip to content

Instantly share code, notes, and snippets.

@zviri
Created December 3, 2013 10:11
Show Gist options
  • Save zviri/7766943 to your computer and use it in GitHub Desktop.
Save zviri/7766943 to your computer and use it in GitHub Desktop.
Mahout cheat-sheet
mahout clusterdump \
-dt sequencefile \ # format: {Integer => String}
-d reuters-vectors/dictionary.file-* \ # dictionary: {id => word}
-i reuters-kmeans-clusters/clusters-3-final \ # input
-o clusters.txt \ # output (local filesystem)
-b 10 \ # format length
-n 10 # number of top terms to print
--distanceMeasure org.apache.mahout.common.distance.CosineDistanceMeasure # default is euclidean distance
mahout kmeans \
-i ".../tfidf-vectors" \ # input vectors (output of seq2sparse)
-o ".../kmeans-clusters" \ # output dir for clusters
-c ".../kmeans-centroids" \ # output/input dir for centroids
--numClusters 500 \
--distanceMeasure org.apache.mahout.common.distance.CosineDistanceMeasure \
--maxIter 100 \
--convergenceDelta 1.0 \ # convergence threshold
-ow \ # overwrite output folder
-cl \ # cluster after finishing
# Sequence file vectorization
mahout seq2sparse \
-i reuters-seqfiles \ # input sequence file
-o reuters-normalized-bigram \ # output dir
-ow \ # overwrite output dir
-a org.apache.lucene.analysis.WhitespaceAnalyzer \ # analyzer (tokenizer)
-chunk 200 \ # chunk size (MB)
-wt tfidf \ # weighting scheme
-s 5 \ # minimum support
-md 3 \ # minimum document frequency
-x 90 \ # maximum document frequency percentage
-ng 2 \ # ngram size
-ml 50 \ # minimum log likelihood ratio
-seq \ # create sequential access sparse vectors (good for kmeans)
-n 2 \ # normalization - use 2-norm (aka Euclidean norm)
-nv \ # output named vectors, with this you'll be able to identify each document
# in clusteredPoints folder after clustering
# human readable sequence file output
mahout seqdumper \
-i /research/corpora-clustering/kmeans-clusters/clusteredPoints/part-m-00000
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment