giuliano-macedo · October 6, 2020 16:06
diff --git a/notebook.ipynb b/notebook.ipynb
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Like [BIRCH](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.Birch.html?highlight=birch#sklearn.cluster.Birch) implementation `partial_fit` does online clustering, while `fit` does offline clustering, both will chunk the data passed in mini-batches, when using `partial_fit` and manually chunking your data you must be careful in the initialization phase, as the algorithm will initialize just once.\n",
    "\n",
    "### Example\n",
    "\n",
    "The dataset is defined for having 2 gaussians generators in the first 1K points, then having 4 gaussians in the next 1K points."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 179,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.datasets import make_blobs\n",
    "import numpy as np\n",
    "\n",
    "def my_blobs(centers):\n",
    "    return make_blobs(\n",
    "        n_samples=1000,\n",
    "        n_features=2,\n",
    "        centers=centers,\n",
    "        cluster_std=0.001,\n",
    "        center_box=(0,0),\n",
    "        random_state=42,\n",
    "        shuffle=True\n",
    "    )[0]\n",
    "\n",
    "dataset=np.concatenate((\n",
    "    my_blobs([  # 2 initial gaussians\n",
    "        [.1,.1],\n",
    "        [.2,.1]\n",
    "    ]),\n",
    "    my_blobs([# then 4 gaussians\n",
    "        [.3,.3],\n",
    "        [.4,.4],\n",
    "        [.5,.5],\n",
    "        [.6,.6],\n",
    "    ]) \n",
    "))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 180,
   "metadata": {},
   "outputs": [],
   "source": [
    "def BaseModel():\n",
    "    return MiniBatchKMeans(\n",
    "        n_clusters=6,\n",
    "        batch_size=50,\n",
    "        init=\"random\",\n",
    "        init_size=200,\n",
    "        random_state=42,\n",
    "#         verbose=True\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let's use fit in MiniBatchKMeans on the entire dataset and get it's labels:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 181,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([2, 2, 3, ..., 0, 5, 0], dtype=int32)"
      ]
     },
     "execution_count": 181,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.cluster import MiniBatchKMeans\n",
    "\n",
    "fit_labels=BaseModel().fit_predict(dataset)\n",
    "\n",
    "fit_labels"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Do the same thing but chunking the dataset in chunks of size 200 and use `partial_fit` in the model:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 182,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([3, 3, 0, ..., 1, 1, 1], dtype=int32)"
      ]
     },
     "execution_count": 182,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model=BaseModel()\n",
    "\n",
    "for i in range(len(dataset)//200):\n",
    "    for _ in range(200):\n",
    "        model.partial_fit(dataset[i*200:(i+1)*200])\n",
    "\n",
    "partial_fit_labels=model.predict(dataset) #get labels\n",
    "partial_fit_labels"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, with `fit`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 183,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([4, 4, 4, ..., 2, 0, 2], dtype=int32)"
      ]
     },
     "execution_count": 183,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model=BaseModel()\n",
    "\n",
    "for i in range(len(dataset)//200):\n",
    "    model.fit(dataset[i*200:(i+1)*200])\n",
    "\n",
    "fit_chunked_labels=model.predict(dataset)\n",
    "fit_chunked_labels"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, with these three labels, we can calculate the [ARI](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_rand_score.html) comparing the labels when using the `fit` in the entire dataset, if the result is 1, then the clustering is the same, if 0 the clustering is random."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 184,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Entire fit X partial fit 0.4346850264322783\n",
      "Entire fit X fit chunked 0.3823732729187776\n"
     ]
    }
   ],
   "source": [
    "from sklearn.metrics import adjusted_rand_score\n",
    "\n",
    "print(\"Entire fit X partial fit\",adjusted_rand_score(fit_labels,partial_fit_labels))\n",
    "print(\"Entire fit X fit chunked\",adjusted_rand_score(fit_labels,fit_chunked_labels))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Like [BIRCH](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.Birch.html?highlight=birch#sklearn.cluster.Birch) implementation `partial_fit` does online clustering, while `fit` does offline clustering, both will chunk the data passed in mini-batches, when using `partial_fit` and manually chunking your data you must be careful in the initialization phase, as the algorithm will initialize just once.\n",
	"\n",
	"### Example\n",
	"\n",
	"The dataset is defined for having 2 gaussians generators in the first 1K points, then having 4 gaussians in the next 1K points."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 179,
	"metadata": {},
	"outputs": [],
	"source": [
	"from sklearn.datasets import make_blobs\n",
	"import numpy as np\n",
	"\n",
	"def my_blobs(centers):\n",
	" return make_blobs(\n",
	" n_samples=1000,\n",
	" n_features=2,\n",
	" centers=centers,\n",
	" cluster_std=0.001,\n",
	" center_box=(0,0),\n",
	" random_state=42,\n",
	" shuffle=True\n",
	" )[0]\n",
	"\n",
	"dataset=np.concatenate((\n",
	" my_blobs([ # 2 initial gaussians\n",
	" [.1,.1],\n",
	" [.2,.1]\n",
	" ]),\n",
	" my_blobs([# then 4 gaussians\n",
	" [.3,.3],\n",
	" [.4,.4],\n",
	" [.5,.5],\n",
	" [.6,.6],\n",
	" ]) \n",
	"))"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 180,
	"metadata": {},
	"outputs": [],
	"source": [
	"def BaseModel():\n",
	" return MiniBatchKMeans(\n",
	" n_clusters=6,\n",
	" batch_size=50,\n",
	" init=\"random\",\n",
	" init_size=200,\n",
	" random_state=42,\n",
	"# verbose=True\n",
	" )"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Now, let's use fit in MiniBatchKMeans on the entire dataset and get it's labels:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 181,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"array([2, 2, 3, ..., 0, 5, 0], dtype=int32)"
	]
	},
	"execution_count": 181,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"from sklearn.cluster import MiniBatchKMeans\n",
	"\n",
	"fit_labels=BaseModel().fit_predict(dataset)\n",
	"\n",
	"fit_labels"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Do the same thing but chunking the dataset in chunks of size 200 and use `partial_fit` in the model:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 182,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"array([3, 3, 0, ..., 1, 1, 1], dtype=int32)"
	]
	},
	"execution_count": 182,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"model=BaseModel()\n",
	"\n",
	"for i in range(len(dataset)//200):\n",
	" for _ in range(200):\n",
	" model.partial_fit(dataset[i200:(i+1)200])\n",
	"\n",
	"partial_fit_labels=model.predict(dataset) #get labels\n",
	"partial_fit_labels"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Now, with `fit`"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 183,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"array([4, 4, 4, ..., 2, 0, 2], dtype=int32)"
	]
	},
	"execution_count": 183,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"model=BaseModel()\n",
	"\n",
	"for i in range(len(dataset)//200):\n",
	" model.fit(dataset[i200:(i+1)200])\n",
	"\n",
	"fit_chunked_labels=model.predict(dataset)\n",
	"fit_chunked_labels"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Finally, with these three labels, we can calculate the [ARI](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_rand_score.html) comparing the labels when using the `fit` in the entire dataset, if the result is 1, then the clustering is the same, if 0 the clustering is random."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 184,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Entire fit X partial fit 0.4346850264322783\n",
	"Entire fit X fit chunked 0.3823732729187776\n"
	]
	}
	],
	"source": [
	"from sklearn.metrics import adjusted_rand_score\n",
	"\n",
	"print(\"Entire fit X partial fit\",adjusted_rand_score(fit_labels,partial_fit_labels))\n",
	"print(\"Entire fit X fit chunked\",adjusted_rand_score(fit_labels,fit_chunked_labels))"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.7.3"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}