-
-
Save alexishida/181e2735f4c7d83849a9b774c2466434 to your computer and use it in GitHub Desktop.
rjb - Ruby Java Bridge - Loading multiple .jar files, for example, Boilerpipe and Stanford parts of speech tagger
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
## rjb - Ruby Java Bridge - Loading multiple .jar files, for example, Boilerpipe and Stanford parts of speech tagger | |
## Works in Ruby 1.9.3 and Nginx with Unicorn. Doesn't work in Phusion Passenger. In unicorn, preload_app has to be false (default) | |
## 'java' folder is in the root rails folder. | |
## http://rjb.rubyforge.org/ | |
## For Stanford PoS tagger: ../java/ folder contains: stanford-postagger.jar, left3words-wsj-0-18.tagger | |
## http://nlp.stanford.edu/software/tagger.shtml | |
## For Boilerpipe, ../java/ folder contains: boilerpipe-1.2.0.jar, nekohtml-1.9.13.jar, xerces-2.9.1.jar | |
## http://code.google.com/p/boilerpipe/ | |
## http://nekohtml.sourceforge.net/ | |
## http://xerces.apache.org/xerces2-j/ | |
## -- Initialize -- | |
## I have this as a ../config/initializers/java.rb file in rails (not sure if that's the best way). | |
## '-Xmx1024m' param at the end is how much memory you want to give to Java | |
Rjb::load("java/stanford-postagger.jar:java/xerces-2.9.1.jar:java/nekohtml-1.9.13.jar:java/boilerpipe-1.2.0.jar", ['-Xmx1024m']) | |
MaxentTagger = Rjb::import('edu.stanford.nlp.tagger.maxent.MaxentTagger') | |
StringReader = Rjb::import('java.io.StringReader') | |
PosTagger = MaxentTagger.new('java/left3words-wsj-0-18.tagger') | |
ArticleSentencesExtractor = Rjb::import('de.l3s.boilerpipe.extractors.ArticleSentencesExtractor') | |
## -- Example Usage -- | |
## Example using Boilerpipe in Ruby | |
extracted_text = ArticleSentencesExtractor.INSTANCE.getText(raw_html) | |
## Example using stanford parts of speech tagger: | |
## Note: reader, tokenizedText, tagedSentences, sentences are rjb Java objects, text_to_be_parsed is a Ruby string | |
reader = StringReader.new(text_to_be_parsed) | |
tokenizedText = MaxentTagger.tokenizeText(reader) | |
tagedSentences = PosTagger.process(tokenizedText) | |
sentences = tagedSentences.toArray() | |
sentences.each do |sentence| | |
words = sentence.toArray() | |
words.each do |word| | |
p word.word().to_s | |
p word.tag().to_s | |
end | |
end | |
## -- Rjb setup required, check rjb docs, my info below: -- | |
## OSX .profile file, make sure java bin is in $PATH | |
export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Home | |
## Ubuntu .bash_rc file | |
## after installing openjdk 6 | |
export PATH=/usr/lib/jvm/java-6-openjdk/bin:$PATH | |
export JAVA_HOME=/usr/lib/jvm/java-6-openjdk |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment