Skip to content

Instantly share code, notes, and snippets.

@cjnolet
Last active October 11, 2024 10:54
Show Gist options
  • Save cjnolet/90257994480f51e377d4fa56a313925f to your computer and use it in GitHub Desktop.
Save cjnolet/90257994480f51e377d4fa56a313925f to your computer and use it in GitHub Desktop.
Simple example of cuML's K-Means Single-GPU (SG) and Multi-Node Multi-GPU (MNMG) APIs compared to Scikit-learn and Dask-ML

Comparing cuML K-Means API Against Scikit-learn & Dask-ML

First, a quick code example of K-Means in Scikit-learn

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

n_centers = 5

X, _ = make_blobs(n_samples=10000, n_centers=n_centers)

k_means = KMeans(n_clusters = n_centers)
k_means.fit(X)

labels = k_means.predict(X)

To use cuML's Single-GPU API, we just change the imports

from cuml.cluster import KMeans
from cuml.datasets import make_blobs

n_centers = 5

X, _ = make_blobs(n_samples=10000, n_centers=n_centers)

k_means = KMeans(n_clusters=n_centers)
k_means.fit(X)

labels = k_means.predict(X)

To use KMeans in Dask-ML, which is CPU-based, we just need to create a Dask Client

from dask_ml.cluster import KMeans
from sklearn.datasets import make_blobs

from dask.distributed import Client
c = Client(<scheduler_address>)

n_centers = 5

X, _ = make_blobs(n_samples=10000, n_centers=n_centers)

k_means = KMeans(n_clusters=n_centers)
k_means.fit(X)

labels = k_means.predict(X)

And to use the multi-node multi-GPU API, we just change the imports again

from cuml.dask.cluster import KMeans
from cuml.dask.datasets import make_blobs

from dask.distributed import Client
c = Client(<scheduler_address>)

n_centers = 5

X, _ = make_blobs(n_samples=10000, n_centers=n_centers)

k_means = KMeans(n_clusters=n_centers)
k_means.fit(X)

labels = k_means.predict(X)

Note: cuml.dask.datasets.make_blobs is available in cuML as of version 0.10.

@joansaurina
Copy link

Hi,

How well does each method perform? Do you have any recommendations on how to choose between them if I have both CPU and multi-GPU options available? Also, what would be the best approach for both large and small datasets?

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment