mino98 · November 5, 2019 22:28
diff --git a/serialize.ipynb b/serialize.ipynb
 {
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pickle"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "vector_size = 256\n",
    "bucket = 2000000"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following are taken from [here](https://github.com/RaRe-Technologies/gensim/blob/27bbb7015dc6bbe02e00bb1853e7952ac13e7fe0/gensim/models/keyedvectors.py#L2202-L2219):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(-0.00390625, 0.00390625)"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "lo, hi = -1.0 / vector_size, 1.0 / vector_size\n",
    "lo, hi"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Generating the ngram vectors the same way as Gensim:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 7.77 s, sys: 2.84 s, total: 10.6 s\n",
      "Wall time: 10.6 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "vectors_ngrams = np.random.uniform(lo, hi, (bucket, vector_size)).astype(np.float32)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(2000000, 256)"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vectors_ngrams.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1. This is how Gensim saves stuff (via pickle, compressed):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 15.9 s, sys: 14.6 s, total: 30.5 s\n",
      "Wall time: 35.4 s\n"
     ]
    }
   ],
   "source": [
    "%%time \n",
    "\n",
    "with open(\"/tmp/vectors_ngrams_gensim.npz\", 'wb') as fp:\n",
    "    pickle.dump(vectors_ngrams, fp, protocol=2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Check that it matches the original array:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 54,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "with open(\"/tmp/vectors_ngrams_gensim.npz\", 'rb') as fp:\n",
    "    test = pickle.load(fp)\n",
    "np.array_equal(vectors_ngrams, test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2. Let's try the same using Numpy's native compressed save:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Numpy seems to simply save the given objects uncompressed and zipping them:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 1min 45s, sys: 4.21 s, total: 1min 49s\n",
      "Wall time: 1min 50s\n"
     ]
    }
   ],
   "source": [
    "%%time \n",
    "np.savez_compressed(\"/tmp/vectors_ngrams_npcompressed.npz\", a=vectors_ngrams)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Check that it matches the original array:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 49,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "test = np.load(\"/tmp/vectors_ngrams_npcompressed.npz\")\n",
    "np.array_equal(vectors_ngrams, test['a'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3. Let's try Numpy's native uncompressed save:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 594 µs, sys: 1.57 s, total: 1.57 s\n",
      "Wall time: 1.57 s\n"
     ]
    }
   ],
   "source": [
    "%%time \n",
    "np.save(\"/tmp/vectors_ngrams_npuncompressed.npy\", vectors_ngrams)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Check that it matches the original array:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "test = np.load(\"/tmp/vectors_ngrams_npuncompressed.npy\")\n",
    "np.array_equal(vectors_ngrams, test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Wow, that's much faster!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Check file sizes and conclusions:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "-rw-r--r-- 1 giacomo giacomo 2.8G Nov  5 22:14 /tmp/vectors_ngrams_gensim.npz\n",
      "-rw-r--r-- 1 giacomo giacomo 1.8G Nov  5 22:16 /tmp/vectors_ngrams_npcompressed.npz\n",
      "-rw-r--r-- 1 giacomo giacomo 2.0G Nov  5 22:16 /tmp/vectors_ngrams_npuncompressed.npy\n"
     ]
    }
   ],
   "source": [
    "%ls -lah /tmp/vectors_ngrams*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Of course, there's no point in trying to compress randomness... so why are we?\n",
    "Shall we patch Gensim's save into using `np.save()` and `np.load()` for all its Numpy objects?\n",
    "\n",
    "Also... \n",
    " * is 2M bucket an overkill, since it translates to ~2GB always in RAM?\n",
    " * could we simply naively redefine Gensim's `REAL` into `np.float16`?"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
 }
	{
	"cells": [
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {},
	"outputs": [],
	"source": [
	"import numpy as np\n",
	"import pickle"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 17,
	"metadata": {},
	"outputs": [],
	"source": [
	"vector_size = 256\n",
	"bucket = 2000000"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"The following are taken from [here](https://github.com/RaRe-Technologies/gensim/blob/27bbb7015dc6bbe02e00bb1853e7952ac13e7fe0/gensim/models/keyedvectors.py#L2202-L2219):"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 18,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"(-0.00390625, 0.00390625)"
	]
	},
	"execution_count": 18,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"lo, hi = -1.0 / vector_size, 1.0 / vector_size\n",
	"lo, hi"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Generating the ngram vectors the same way as Gensim:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 21,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"CPU times: user 7.77 s, sys: 2.84 s, total: 10.6 s\n",
	"Wall time: 10.6 s\n"
	]
	}
	],
	"source": [
	"%%time\n",
	"vectors_ngrams = np.random.uniform(lo, hi, (bucket, vector_size)).astype(np.float32)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 24,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"(2000000, 256)"
	]
	},
	"execution_count": 24,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"vectors_ngrams.shape"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"#### 1. This is how Gensim saves stuff (via pickle, compressed):"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 26,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"CPU times: user 15.9 s, sys: 14.6 s, total: 30.5 s\n",
	"Wall time: 35.4 s\n"
	]
	}
	],
	"source": [
	"%%time \n",
	"\n",
	"with open(\"/tmp/vectors_ngrams_gensim.npz\", 'wb') as fp:\n",
	" pickle.dump(vectors_ngrams, fp, protocol=2)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Check that it matches the original array:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 54,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"True"
	]
	},
	"execution_count": 54,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"with open(\"/tmp/vectors_ngrams_gensim.npz\", 'rb') as fp:\n",
	" test = pickle.load(fp)\n",
	"np.array_equal(vectors_ngrams, test)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"#### 2. Let's try the same using Numpy's native compressed save:"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Numpy seems to simply save the given objects uncompressed and zipping them:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 44,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"CPU times: user 1min 45s, sys: 4.21 s, total: 1min 49s\n",
	"Wall time: 1min 50s\n"
	]
	}
	],
	"source": [
	"%%time \n",
	"np.savez_compressed(\"/tmp/vectors_ngrams_npcompressed.npz\", a=vectors_ngrams)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Check that it matches the original array:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 49,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"True"
	]
	},
	"execution_count": 49,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"test = np.load(\"/tmp/vectors_ngrams_npcompressed.npz\")\n",
	"np.array_equal(vectors_ngrams, test['a'])"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"#### 3. Let's try Numpy's native uncompressed save:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 28,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"CPU times: user 594 µs, sys: 1.57 s, total: 1.57 s\n",
	"Wall time: 1.57 s\n"
	]
	}
	],
	"source": [
	"%%time \n",
	"np.save(\"/tmp/vectors_ngrams_npuncompressed.npy\", vectors_ngrams)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Check that it matches the original array:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 37,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"True"
	]
	},
	"execution_count": 37,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"test = np.load(\"/tmp/vectors_ngrams_npuncompressed.npy\")\n",
	"np.array_equal(vectors_ngrams, test)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Wow, that's much faster!"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"#### Check file sizes and conclusions:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 33,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"-rw-r--r-- 1 giacomo giacomo 2.8G Nov 5 22:14 /tmp/vectors_ngrams_gensim.npz\n",
	"-rw-r--r-- 1 giacomo giacomo 1.8G Nov 5 22:16 /tmp/vectors_ngrams_npcompressed.npz\n",
	"-rw-r--r-- 1 giacomo giacomo 2.0G Nov 5 22:16 /tmp/vectors_ngrams_npuncompressed.npy\n"
	]
	}
	],
	"source": [
	"%ls -lah /tmp/vectors_ngrams*"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Of course, there's no point in trying to compress randomness... so why are we?\n",
	"Shall we patch Gensim's save into using `np.save()` and `np.load()` for all its Numpy objects?\n",
	"\n",
	"Also... \n",
	" * is 2M bucket an overkill, since it translates to ~2GB always in RAM?\n",
	" * could we simply naively redefine Gensim's `REAL` into `np.float16`?"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.7.3"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 4
	}