ottokart’s gists

ottokart / word2vec-binary-to-python-dict.py

Last active July 25, 2019 22:41

Python script to convert a binary file containing word2vec pre-trained word embeddings into a pickled python dict.

	# coding: utf-8
	from __future__ import division

	import struct
	import sys

	FILE_NAME = "GoogleNews-vectors-negative300.bin"
	MAX_VECTORS = 200000 # This script takes a lot of RAM (>2GB for 200K vectors), if you want to use the full 3M embeddings then you probably need to insert the vectors into some kind of database
	FLOAT_SIZE = 4 # 32bit float

ottokart / nn.py

Last active August 27, 2021 05:52

3-layer neural network example with dropout in 2nd layer

	# Tiny example of 3-layer nerual network with dropout in 2nd hidden layer
	# Output layer is linear with L2 cost (regression model)
	# Hidden layer activation is tanh

	import numpy as np

	n_epochs = 100
	n_samples = 100
	n_in = 10
	n_hidden = 5

ottokart / momentum_demo.py

Created October 3, 2016 13:11

	# coding: utf-8

	# A little demo illustrating the effect of momentum in neural network training.
	# Try using different values for MOMENTUM constant below (e.g. compare 0.0 with 0.9).
	# This neural network is actually more like logistic regression, but I have used
	# squared error to make the error surface more interesting.

	import numpy as np
	import pylab

ottokart / som.py

Created December 12, 2016 13:58

Simple Self-Organizing Map

	from __future__ import division

	import numpy as np
	import matplotlib.pyplot as plt

	np.random.seed(1)

	# Replace X with your own data if you wish
	##########################################
	N = 300 # num samples

ottokart / word2vec-binary-to-text.py

Last active February 1, 2021 17:25

Python script to convert word2vec pre-trained word embeddings from a binary format into a text format where each line starts with a word followed by corresponding embedding vector entries separated by spaces. E.g., "dog 0.41231234567890 0.355122341578123 ..."

	# coding: utf-8
	from __future__ import division

	import struct
	import sys
	import gzip

	FILE_NAME = "GoogleNews-vectors-negative300.bin.gz" # outputs GoogleNews-vectors-negative300.bin.gz.txt
	MAX_VECTORS = 100000 # Top words to take
	FLOAT_SIZE = 4 # 32bit float

ottokart / punct_server.py

Created May 1, 2019 18:34

Simple script to serve a punctuation API on https://github.com/ottokart/punctuator2 (similar to http://bark.phon.ioc.ee/punctuator). need to change the configuration at the beginning of the file to point to the model file. Also, the script should be in the same dir as the other punctuator scripts. The Europarl model that I use in the demo (http:…

	# coding: utf-8

	from __future__ import division
	from nltk.tokenize import word_tokenize

	import models
	import data

	import theano
	import tornado.ioloop

Ottokar Siirak ottokart