Skip to content

Instantly share code, notes, and snippets.

@hiropppe
Last active May 19, 2016 04:01
Show Gist options
  • Save hiropppe/07267856f6aaf0de769e9b67b07e029d to your computer and use it in GitHub Desktop.
Save hiropppe/07267856f6aaf0de769e9b67b07e029d to your computer and use it in GitHub Desktop.
yum install -y rubygems ruby-devel
gem install wp2txt
curl -O https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-articles1.xml.bz2
wp2txt --input-file jawiki-latest-pages-articles1.xml.bz2
cat jawiki-latest-pages-articles1.xml-* > corpus.txt
from gensim.models import word2vec
sentences = word2vec.Text8Corpus("corpus_wakati.txt")
model = word2vec.Word2Vec(sentences, min_count=1, size=100)
model[u'セール']
model.similarity(u'セール', u'パーティー')
model.similarity(u'セール', u'渋谷')
model.similarity(u'セール', u'開催')
model.most_similar(positive=[u'セール'])
model.save('sample.model')
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment