Created
October 6, 2020 16:06
-
-
Save giuliano-macedo/513407f566529e5ee72c895c437e3658 to your computer and use it in GitHub Desktop.
notebook for https://stackoverflow.com/questions/53091623/differences-between-minibatchkmeans-fit-and-minibatchkmeans-partial-fit
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Like [BIRCH](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.Birch.html?highlight=birch#sklearn.cluster.Birch) implementation `partial_fit` does online clustering, while `fit` does offline clustering, both will chunk the data passed in mini-batches, when using `partial_fit` and manually chunking your data you must be careful in the initialization phase, as the algorithm will initialize just once.\n", | |
"\n", | |
"### Example\n", | |
"\n", | |
"The dataset is defined for having 2 gaussians generators in the first 1K points, then having 4 gaussians in the next 1K points." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 179, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"from sklearn.datasets import make_blobs\n", | |
"import numpy as np\n", | |
"\n", | |
"def my_blobs(centers):\n", | |
" return make_blobs(\n", | |
" n_samples=1000,\n", | |
" n_features=2,\n", | |
" centers=centers,\n", | |
" cluster_std=0.001,\n", | |
" center_box=(0,0),\n", | |
" random_state=42,\n", | |
" shuffle=True\n", | |
" )[0]\n", | |
"\n", | |
"dataset=np.concatenate((\n", | |
" my_blobs([ # 2 initial gaussians\n", | |
" [.1,.1],\n", | |
" [.2,.1]\n", | |
" ]),\n", | |
" my_blobs([# then 4 gaussians\n", | |
" [.3,.3],\n", | |
" [.4,.4],\n", | |
" [.5,.5],\n", | |
" [.6,.6],\n", | |
" ]) \n", | |
"))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 180, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"def BaseModel():\n", | |
" return MiniBatchKMeans(\n", | |
" n_clusters=6,\n", | |
" batch_size=50,\n", | |
" init=\"random\",\n", | |
" init_size=200,\n", | |
" random_state=42,\n", | |
"# verbose=True\n", | |
" )" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Now, let's use fit in MiniBatchKMeans on the entire dataset and get it's labels:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 181, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"array([2, 2, 3, ..., 0, 5, 0], dtype=int32)" | |
] | |
}, | |
"execution_count": 181, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"from sklearn.cluster import MiniBatchKMeans\n", | |
"\n", | |
"fit_labels=BaseModel().fit_predict(dataset)\n", | |
"\n", | |
"fit_labels" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Do the same thing but chunking the dataset in chunks of size 200 and use `partial_fit` in the model:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 182, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"array([3, 3, 0, ..., 1, 1, 1], dtype=int32)" | |
] | |
}, | |
"execution_count": 182, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"model=BaseModel()\n", | |
"\n", | |
"for i in range(len(dataset)//200):\n", | |
" for _ in range(200):\n", | |
" model.partial_fit(dataset[i*200:(i+1)*200])\n", | |
"\n", | |
"partial_fit_labels=model.predict(dataset) #get labels\n", | |
"partial_fit_labels" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Now, with `fit`" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 183, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"array([4, 4, 4, ..., 2, 0, 2], dtype=int32)" | |
] | |
}, | |
"execution_count": 183, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"model=BaseModel()\n", | |
"\n", | |
"for i in range(len(dataset)//200):\n", | |
" model.fit(dataset[i*200:(i+1)*200])\n", | |
"\n", | |
"fit_chunked_labels=model.predict(dataset)\n", | |
"fit_chunked_labels" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Finally, with these three labels, we can calculate the [ARI](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_rand_score.html) comparing the labels when using the `fit` in the entire dataset, if the result is 1, then the clustering is the same, if 0 the clustering is random." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 184, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Entire fit X partial fit 0.4346850264322783\n", | |
"Entire fit X fit chunked 0.3823732729187776\n" | |
] | |
} | |
], | |
"source": [ | |
"from sklearn.metrics import adjusted_rand_score\n", | |
"\n", | |
"print(\"Entire fit X partial fit\",adjusted_rand_score(fit_labels,partial_fit_labels))\n", | |
"print(\"Entire fit X fit chunked\",adjusted_rand_score(fit_labels,fit_chunked_labels))" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.7.3" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment