First, a quick code example of K-Means in Scikit-learn
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
n_centers = 5
X, _ = make_blobs(n_samples=10000, n_centers=n_centers)
k_means = KMeans(n_clusters = n_centers)
k_means.fit(X)
labels = k_means.predict(X)
To use cuML's Single-GPU API, we just change the imports
from cuml.cluster import KMeans
from cuml.datasets import make_blobs
n_centers = 5
X, _ = make_blobs(n_samples=10000, n_centers=n_centers)
k_means = KMeans(n_clusters=n_centers)
k_means.fit(X)
labels = k_means.predict(X)
To use KMeans in Dask-ML, which is CPU-based, we just need to create a Dask Client
from dask_ml.cluster import KMeans
from sklearn.datasets import make_blobs
from dask.distributed import Client
c = Client(<scheduler_address>)
n_centers = 5
X, _ = make_blobs(n_samples=10000, n_centers=n_centers)
k_means = KMeans(n_clusters=n_centers)
k_means.fit(X)
labels = k_means.predict(X)
And to use the multi-node multi-GPU API, we just change the imports again
from cuml.dask.cluster import KMeans
from cuml.dask.datasets import make_blobs
from dask.distributed import Client
c = Client(<scheduler_address>)
n_centers = 5
X, _ = make_blobs(n_samples=10000, n_centers=n_centers)
k_means = KMeans(n_clusters=n_centers)
k_means.fit(X)
labels = k_means.predict(X)
Note: cuml.dask.datasets.make_blobs
is available in cuML as of version 0.10.
Hi,
How well does each method perform? Do you have any recommendations on how to choose between them if I have both CPU and multi-GPU options available? Also, what would be the best approach for both large and small datasets?
Thanks!