Skip to content

Instantly share code, notes, and snippets.

@lacic
Created October 6, 2017 12:59
Show Gist options
  • Select an option

  • Save lacic/8d65d7462cef31bf85b039434211b171 to your computer and use it in GitHub Desktop.

Select an option

Save lacic/8d65d7462cef31bf85b039434211b171 to your computer and use it in GitHub Desktop.
DocAwareIterator jaIter = new DocAwareIterator(myDocData);
AbstractCache<VocabWord> cache = new AbstractCache<>();
TokenizerFactory t = new DefaultTokenizerFactory();
t.setTokenPreProcessor(new CommonPreprocessor());
ParagraphVectors.Builder vecBuilder = new ParagraphVectors.Builder()
.minWordFrequency(1)
.iterations(5)
.epochs(5)
.layerSize(100)
.learningRate(0.025)
.labelsSource(jaIter.getLabelsSource())
.windowSize(5)
.iterate(jaIter)
.trainWordVectors(false)
.vocabCache(cache)
.tokenizerFactory(t)
.sampling(0)
.negativeSample(5)
.useHierarchicSoftmax(false);
ParagraphVectors vec = vecBuilder.build();
vec.fit();
String modelName = "myModel";
File storedModel = new File("/tmp/", modelName);
WordVectorSerializer.writeParagraphVectors(vec, storedModel);
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment