Skip to content

Instantly share code, notes, and snippets.

@mino98
Created November 5, 2019 22:28
Show Gist options
  • Save mino98/523bcc4caa43d567e1059569052a1cf5 to your computer and use it in GitHub Desktop.
Save mino98/523bcc4caa43d567e1059569052a1cf5 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pickle"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"vector_size = 256\n",
"bucket = 2000000"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The following are taken from [here](https://github.com/RaRe-Technologies/gensim/blob/27bbb7015dc6bbe02e00bb1853e7952ac13e7fe0/gensim/models/keyedvectors.py#L2202-L2219):"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(-0.00390625, 0.00390625)"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"lo, hi = -1.0 / vector_size, 1.0 / vector_size\n",
"lo, hi"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Generating the ngram vectors the same way as Gensim:"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 7.77 s, sys: 2.84 s, total: 10.6 s\n",
"Wall time: 10.6 s\n"
]
}
],
"source": [
"%%time\n",
"vectors_ngrams = np.random.uniform(lo, hi, (bucket, vector_size)).astype(np.float32)"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(2000000, 256)"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vectors_ngrams.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 1. This is how Gensim saves stuff (via pickle, compressed):"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 15.9 s, sys: 14.6 s, total: 30.5 s\n",
"Wall time: 35.4 s\n"
]
}
],
"source": [
"%%time \n",
"\n",
"with open(\"/tmp/vectors_ngrams_gensim.npz\", 'wb') as fp:\n",
" pickle.dump(vectors_ngrams, fp, protocol=2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Check that it matches the original array:"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 54,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"with open(\"/tmp/vectors_ngrams_gensim.npz\", 'rb') as fp:\n",
" test = pickle.load(fp)\n",
"np.array_equal(vectors_ngrams, test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 2. Let's try the same using Numpy's native compressed save:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Numpy seems to simply save the given objects uncompressed and zipping them:"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 1min 45s, sys: 4.21 s, total: 1min 49s\n",
"Wall time: 1min 50s\n"
]
}
],
"source": [
"%%time \n",
"np.savez_compressed(\"/tmp/vectors_ngrams_npcompressed.npz\", a=vectors_ngrams)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Check that it matches the original array:"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"test = np.load(\"/tmp/vectors_ngrams_npcompressed.npz\")\n",
"np.array_equal(vectors_ngrams, test['a'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 3. Let's try Numpy's native uncompressed save:"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 594 µs, sys: 1.57 s, total: 1.57 s\n",
"Wall time: 1.57 s\n"
]
}
],
"source": [
"%%time \n",
"np.save(\"/tmp/vectors_ngrams_npuncompressed.npy\", vectors_ngrams)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Check that it matches the original array:"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"test = np.load(\"/tmp/vectors_ngrams_npuncompressed.npy\")\n",
"np.array_equal(vectors_ngrams, test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Wow, that's much faster!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Check file sizes and conclusions:"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"-rw-r--r-- 1 giacomo giacomo 2.8G Nov 5 22:14 /tmp/vectors_ngrams_gensim.npz\n",
"-rw-r--r-- 1 giacomo giacomo 1.8G Nov 5 22:16 /tmp/vectors_ngrams_npcompressed.npz\n",
"-rw-r--r-- 1 giacomo giacomo 2.0G Nov 5 22:16 /tmp/vectors_ngrams_npuncompressed.npy\n"
]
}
],
"source": [
"%ls -lah /tmp/vectors_ngrams*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Of course, there's no point in trying to compress randomness... so why are we?\n",
"Shall we patch Gensim's save into using `np.save()` and `np.load()` for all its Numpy objects?\n",
"\n",
"Also... \n",
" * is 2M bucket an overkill, since it translates to ~2GB always in RAM?\n",
" * could we simply naively redefine Gensim's `REAL` into `np.float16`?"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment