Skip to content

Instantly share code, notes, and snippets.

@giuliano-macedo
Created October 6, 2020 16:06
Show Gist options
  • Save giuliano-macedo/513407f566529e5ee72c895c437e3658 to your computer and use it in GitHub Desktop.
Save giuliano-macedo/513407f566529e5ee72c895c437e3658 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Like [BIRCH](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.Birch.html?highlight=birch#sklearn.cluster.Birch) implementation `partial_fit` does online clustering, while `fit` does offline clustering, both will chunk the data passed in mini-batches, when using `partial_fit` and manually chunking your data you must be careful in the initialization phase, as the algorithm will initialize just once.\n",
"\n",
"### Example\n",
"\n",
"The dataset is defined for having 2 gaussians generators in the first 1K points, then having 4 gaussians in the next 1K points."
]
},
{
"cell_type": "code",
"execution_count": 179,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.datasets import make_blobs\n",
"import numpy as np\n",
"\n",
"def my_blobs(centers):\n",
" return make_blobs(\n",
" n_samples=1000,\n",
" n_features=2,\n",
" centers=centers,\n",
" cluster_std=0.001,\n",
" center_box=(0,0),\n",
" random_state=42,\n",
" shuffle=True\n",
" )[0]\n",
"\n",
"dataset=np.concatenate((\n",
" my_blobs([ # 2 initial gaussians\n",
" [.1,.1],\n",
" [.2,.1]\n",
" ]),\n",
" my_blobs([# then 4 gaussians\n",
" [.3,.3],\n",
" [.4,.4],\n",
" [.5,.5],\n",
" [.6,.6],\n",
" ]) \n",
"))"
]
},
{
"cell_type": "code",
"execution_count": 180,
"metadata": {},
"outputs": [],
"source": [
"def BaseModel():\n",
" return MiniBatchKMeans(\n",
" n_clusters=6,\n",
" batch_size=50,\n",
" init=\"random\",\n",
" init_size=200,\n",
" random_state=42,\n",
"# verbose=True\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, let's use fit in MiniBatchKMeans on the entire dataset and get it's labels:"
]
},
{
"cell_type": "code",
"execution_count": 181,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([2, 2, 3, ..., 0, 5, 0], dtype=int32)"
]
},
"execution_count": 181,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.cluster import MiniBatchKMeans\n",
"\n",
"fit_labels=BaseModel().fit_predict(dataset)\n",
"\n",
"fit_labels"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Do the same thing but chunking the dataset in chunks of size 200 and use `partial_fit` in the model:"
]
},
{
"cell_type": "code",
"execution_count": 182,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([3, 3, 0, ..., 1, 1, 1], dtype=int32)"
]
},
"execution_count": 182,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model=BaseModel()\n",
"\n",
"for i in range(len(dataset)//200):\n",
" for _ in range(200):\n",
" model.partial_fit(dataset[i*200:(i+1)*200])\n",
"\n",
"partial_fit_labels=model.predict(dataset) #get labels\n",
"partial_fit_labels"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, with `fit`"
]
},
{
"cell_type": "code",
"execution_count": 183,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([4, 4, 4, ..., 2, 0, 2], dtype=int32)"
]
},
"execution_count": 183,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model=BaseModel()\n",
"\n",
"for i in range(len(dataset)//200):\n",
" model.fit(dataset[i*200:(i+1)*200])\n",
"\n",
"fit_chunked_labels=model.predict(dataset)\n",
"fit_chunked_labels"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, with these three labels, we can calculate the [ARI](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_rand_score.html) comparing the labels when using the `fit` in the entire dataset, if the result is 1, then the clustering is the same, if 0 the clustering is random."
]
},
{
"cell_type": "code",
"execution_count": 184,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Entire fit X partial fit 0.4346850264322783\n",
"Entire fit X fit chunked 0.3823732729187776\n"
]
}
],
"source": [
"from sklearn.metrics import adjusted_rand_score\n",
"\n",
"print(\"Entire fit X partial fit\",adjusted_rand_score(fit_labels,partial_fit_labels))\n",
"print(\"Entire fit X fit chunked\",adjusted_rand_score(fit_labels,fit_chunked_labels))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment