Skip to content

Instantly share code, notes, and snippets.

@gavinmh
Last active December 24, 2015 04:19
Show Gist options
  • Select an option

  • Save gavinmh/6742710 to your computer and use it in GitHub Desktop.

Select an option

Save gavinmh/6742710 to your computer and use it in GitHub Desktop.
mrlda-hadoop

Setup

bin/hadoop dfs -mkdir /home/hduser/raw_text
bin/hadoop dfs -mkdir /home/hduser/index
bin/hadoop dfs -mkdir /home/hduser/output
bin/hadoop dfs -copyFromLocal /home/gavin/dev/Mr.LDA/data/corpus1.txt /home/hduser/raw_text

Tokenize and Index

bin/hadoop jar /home/gavin/dev/Mr.LDA/bin/Mr.LDA-0.0.1.jar cc.mrlda.ParseCorpus -input /home/hduser/raw -output /home/hduser/indexed -mapper 2 -reducer 1

LDA

bin/hadoop jar /home/gavin/dev/Mr.LDA/bin/Mr.LDA-0.0.1.jar cc.mrlda.VariationalInference -input /home/hduser/indexed/document -output /home/hduser/output -term 2000 -topic 3 -iteration 50 -mapper 4 -reducer 1

Distributions over terms for topics

bin/hadoop jar /home/gavin/dev/Mr.LDA/bin/Mr.LDA-0.0.1.jar cc.mrlda.DisplayTopic -input /home/hduser/output/beta-30 -index /home/hduser/index/term -topdisplay 20

Distributions over documents for terms

bin/hadoop jar /home/gavin/dev/Mr.LDA/bin/Mr.LDA-0.0.1.jar cc.mrlda.DisplayDocument -input /home/hduser/output/gamma-30

Other

Reading a sequence file

bin/hadoop jar /home/gavin/dev/Mr.LDA/bin/Mr.LDA-0.0.1.jar edu.umd.cloud9.io.ReadSequenceFile /home/hduser/indexed/term
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment