Skip to content

Instantly share code, notes, and snippets.

@tokestermw
Last active July 20, 2016 01:09
Show Gist options
  • Save tokestermw/adcd741e5c0d284cdc279d1fc9a892c3 to your computer and use it in GitHub Desktop.
Save tokestermw/adcd741e5c0d284cdc279d1fc9a892c3 to your computer and use it in GitHub Desktop.
script to make word2vec format proper for gensim, then binarize, also save normed and unnormed vectors
vocab=400000
tokens=42
dim=300
filename="glove.${tokens}B.${dim}d"
n="_unnorm"
txt=".txt"
bin=".bin"
echo "${vocab} ${dim}" > $filename$n$txt
cat $filename$txt >> $filename$n$txt
python - <<END
from gensim.models import Word2Vec
model = Word2Vec.load_word2vec_format("$filename$n$txt")
model.save_word2vec_format("$filename$n$bin", binary=True)
model.init_sims(replace=True)
model.save_word2vec_format("$filename$bin", binary=True)
END
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment