paulhendricks · November 16, 2020 20:34
diff --git a/mnmg_knn.ipynb b/mnmg_knn.ipynb
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Nearest Neighbors Multi-Node Multi-GPU (MNMG) Demo\n",
    "\n",
    "The nearest neighbors multi-Node multi-GPU implementation leverages Dask to spread data and computations across multiple workers. cuML uses One Process Per GPU (OPG) layout, which maps a single Dask worker to each GPU.\n",
    "\n",
    "The main difference between cuML's MNMG implementation of nearest neighbors and the single-GPU is that the `kneighbors()` query partitions can be broadcast each of the workers in batches and the nearest neighbors search performed in parallel.\n",
    "\n",
    "Unlike the single-GPU implementation, The MNMG nearest neighbors API currently requires a Dask cuDF Dataframe as input. `kneighbors()` also returns a Dask cuDF Dataframe. The Dask cuDF Dataframe API is very similar to the Dask DataFrame API, but underlying Dataframes are cuDF, rather than Pandas.\n",
    "\n",
    "For information on converting your dataset to Dask cuDF format: https://rapidsai.github.io/projects/cudf/en/0.11.0/dask-cudf.html#multi-gpu-with-dask-cudf\n",
    "\n",
    "For additional information on cuML's MNMG nearest neighbors implementation: https://rapidsai.github.io/projects/cuml/en/0.11.0/api.html#cuml.dask.neighbors.NearestNeighbors"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "import pandas as pd\n",
    "import cudf\n",
    "\n",
    "from cuml.dask.common import to_dask_df\n",
    "from cuml.dask.datasets import make_blobs as make_blobs_cuml\n",
    "\n",
    "from dask.distributed import Client, wait\n",
    "from dask_cuda import LocalCUDACluster\n",
    "\n",
    "from sklearn.neighbors import NearestNeighbors as skNeighbors\n",
    "from cuml.dask.neighbors import NearestNeighbors as cumlNeighbors"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Start Dask Cluster\n",
    "\n",
    "We can use the `LocalCUDACluster` to start a Dask cluster on a single machine with one worker mapped to each GPU. This is called one-process-per-GPU (OPG). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "cluster = LocalCUDACluster(threads_per_worker=1)\n",
    "client = Client(cluster)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Define Parameters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "n_samples = 1_00_000\n",
    "n_features = 256\n",
    "\n",
    "n_total_partitions = len(list(client.has_what().keys()))\n",
    "\n",
    "n_neighbors = 5\n",
    "\n",
    "n_query = 5000"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generate Data\n",
    "\n",
    "### Device\n",
    "\n",
    "We can generate a dask_cudf.DataFrame of synthetic data for multiple clusters using `cuml.dask.datasets.make_blobs`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_cudf_train, _ = make_blobs_cuml(n_samples, \n",
    "                                  n_features,\n",
    "                                  centers = 5, \n",
    "                                  n_parts = n_total_partitions,\n",
    "                                  cluster_std=0.1, \n",
    "                                  verbose=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_cudf_query, _ = make_blobs_cuml(n_query, \n",
    "                                  n_features,\n",
    "                                  centers = 5, \n",
    "                                  n_parts = n_total_partitions,\n",
    "                                  cluster_std=0.1, \n",
    "                                  verbose=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## cuML Model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 185 ms, sys: 2.03 ms, total: 187 ms\n",
      "Wall time: 3.13 s\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "<cuml.dask.neighbors.nearest_neighbors.NearestNeighbors at 0x7f172a298160>"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%time\n",
    "knn_cuml = cumlNeighbors(algorithm=\"brute\")\n",
    "knn_cuml.fit(X_cudf_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 276 ms, sys: 35.1 ms, total: 311 ms\n",
      "Wall time: 1.91 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "D_cuml, I_cuml = knn_cuml.kneighbors(X_cudf_query)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Host\n",
    "\n",
    "We use `cuml.dask.common.to_dask_df` to convert a dask_cuml.DataFrame using device memory into a dask.DataFrame containing Pandas dataframe in host memory. Since our baseline is not distributed, we use `compute()` to bring our data to a single process."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.datasets import make_blobs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train, _ = make_blobs(n_samples, \n",
    "                        n_features,\n",
    "                        centers = 5, \n",
    "                        cluster_std=0.1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_query, _ = make_blobs(n_query, \n",
    "                        n_features,\n",
    "                        centers = 5, \n",
    "                        cluster_std=0.1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Scikit-learn model\n",
    "\n",
    "Since there is no distributed Scikit-learn equivalent to cuML's MNMG Nearest Neighbors implementation, we will use the basic brute-force nearest neighbors implementation from Scikit-learn as our baseline. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 20.1 ms, sys: 0 ns, total: 20.1 ms\n",
      "Wall time: 17.9 ms\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "NearestNeighbors(algorithm='brute', n_jobs=-1)"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%time\n",
    "knn_sk = skNeighbors(algorithm=\"brute\", n_jobs=-1)\n",
    "knn_sk.fit(X_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 1min 16s, sys: 1min 30s, total: 2min 46s\n",
      "Wall time: 12.1 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "D_sk, I_sk = knn_sk.kneighbors(X_query, n_neighbors)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Compare Results\n",
    "\n",
    "cuML currently uses FAISS for exact nearest neighbors search, which limits inputs to single-precision. This results in possible round-off errors when floats of different magnitude are added. As a result, it's very likely that the cuML results will not match Scikit-learn's nearest neighbors exactly. You can read more in the [FAISS wiki](https://github.com/facebookresearch/faiss/wiki/FAQ#why-do-i-get-weird-results-with-brute-force-search-on-vectors-with-large-components).\n",
    "\n",
    "### Distances"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "# passed = np.allclose(D_sk, D_cuml.compute().as_gpu_matrix(), atol=1e-3)\n",
    "# print('compare knn: cuml vs sklearn distances %s'%('equal'if passed else 'NOT equal'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Indices"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "# sk_sorted = np.sort(I_sk, axis=1)\n",
    "# cuml_sorted = np.sort(I_cuml.compute().as_gpu_matrix(), axis=1)\n",
    "\n",
    "# diff = sk_sorted - cuml_sorted\n",
    "\n",
    "# # Pass if differences are less than .1%\n",
    "# passed = (len(diff[diff!=0]) / n_query) < 1e-2\n",
    "# print('compare knn: cuml vs sklearn indexes %s'%('equal'if passed else 'NOT equal'))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
 }
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Nearest Neighbors Multi-Node Multi-GPU (MNMG) Demo\n",
	"\n",
	"The nearest neighbors multi-Node multi-GPU implementation leverages Dask to spread data and computations across multiple workers. cuML uses One Process Per GPU (OPG) layout, which maps a single Dask worker to each GPU.\n",
	"\n",
	"The main difference between cuML's MNMG implementation of nearest neighbors and the single-GPU is that the `kneighbors()` query partitions can be broadcast each of the workers in batches and the nearest neighbors search performed in parallel.\n",
	"\n",
	"Unlike the single-GPU implementation, The MNMG nearest neighbors API currently requires a Dask cuDF Dataframe as input. `kneighbors()` also returns a Dask cuDF Dataframe. The Dask cuDF Dataframe API is very similar to the Dask DataFrame API, but underlying Dataframes are cuDF, rather than Pandas.\n",
	"\n",
	"For information on converting your dataset to Dask cuDF format: https://rapidsai.github.io/projects/cudf/en/0.11.0/dask-cudf.html#multi-gpu-with-dask-cudf\n",
	"\n",
	"For additional information on cuML's MNMG nearest neighbors implementation: https://rapidsai.github.io/projects/cuml/en/0.11.0/api.html#cuml.dask.neighbors.NearestNeighbors"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {},
	"outputs": [],
	"source": [
	"import numpy as np\n",
	"\n",
	"import pandas as pd\n",
	"import cudf\n",
	"\n",
	"from cuml.dask.common import to_dask_df\n",
	"from cuml.dask.datasets import make_blobs as make_blobs_cuml\n",
	"\n",
	"from dask.distributed import Client, wait\n",
	"from dask_cuda import LocalCUDACluster\n",
	"\n",
	"from sklearn.neighbors import NearestNeighbors as skNeighbors\n",
	"from cuml.dask.neighbors import NearestNeighbors as cumlNeighbors"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Start Dask Cluster\n",
	"\n",
	"We can use the `LocalCUDACluster` to start a Dask cluster on a single machine with one worker mapped to each GPU. This is called one-process-per-GPU (OPG). "
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {},
	"outputs": [],
	"source": [
	"cluster = LocalCUDACluster(threads_per_worker=1)\n",
	"client = Client(cluster)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Define Parameters"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {},
	"outputs": [],
	"source": [
	"n_samples = 1_00_000\n",
	"n_features = 256\n",
	"\n",
	"n_total_partitions = len(list(client.has_what().keys()))\n",
	"\n",
	"n_neighbors = 5\n",
	"\n",
	"n_query = 5000"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Generate Data\n",
	"\n",
	"### Device\n",
	"\n",
	"We can generate a dask_cudf.DataFrame of synthetic data for multiple clusters using `cuml.dask.datasets.make_blobs`."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {},
	"outputs": [],
	"source": [
	"X_cudf_train, _ = make_blobs_cuml(n_samples, \n",
	" n_features,\n",
	" centers = 5, \n",
	" n_parts = n_total_partitions,\n",
	" cluster_std=0.1, \n",
	" verbose=True)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"metadata": {},
	"outputs": [],
	"source": [
	"X_cudf_query, _ = make_blobs_cuml(n_query, \n",
	" n_features,\n",
	" centers = 5, \n",
	" n_parts = n_total_partitions,\n",
	" cluster_std=0.1, \n",
	" verbose=True)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## cuML Model"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 6,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"CPU times: user 185 ms, sys: 2.03 ms, total: 187 ms\n",
	"Wall time: 3.13 s\n"
	]
	},
	{
	"data": {
	"text/plain": [
	"<cuml.dask.neighbors.nearest_neighbors.NearestNeighbors at 0x7f172a298160>"
	]
	},
	"execution_count": 6,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"%%time\n",
	"knn_cuml = cumlNeighbors(algorithm=\"brute\")\n",
	"knn_cuml.fit(X_cudf_train)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 7,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"CPU times: user 276 ms, sys: 35.1 ms, total: 311 ms\n",
	"Wall time: 1.91 s\n"
	]
	}
	],
	"source": [
	"%%time\n",
	"D_cuml, I_cuml = knn_cuml.kneighbors(X_cudf_query)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Host\n",
	"\n",
	"We use `cuml.dask.common.to_dask_df` to convert a dask_cuml.DataFrame using device memory into a dask.DataFrame containing Pandas dataframe in host memory. Since our baseline is not distributed, we use `compute()` to bring our data to a single process."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 8,
	"metadata": {},
	"outputs": [],
	"source": [
	"from sklearn.datasets import make_blobs"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 9,
	"metadata": {},
	"outputs": [],
	"source": [
	"X_train, _ = make_blobs(n_samples, \n",
	" n_features,\n",
	" centers = 5, \n",
	" cluster_std=0.1)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 10,
	"metadata": {},
	"outputs": [],
	"source": [
	"X_query, _ = make_blobs(n_query, \n",
	" n_features,\n",
	" centers = 5, \n",
	" cluster_std=0.1)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Scikit-learn model\n",
	"\n",
	"Since there is no distributed Scikit-learn equivalent to cuML's MNMG Nearest Neighbors implementation, we will use the basic brute-force nearest neighbors implementation from Scikit-learn as our baseline. "
	]
	},
	{
	"cell_type": "code",
	"execution_count": 11,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"CPU times: user 20.1 ms, sys: 0 ns, total: 20.1 ms\n",
	"Wall time: 17.9 ms\n"
	]
	},
	{
	"data": {
	"text/plain": [
	"NearestNeighbors(algorithm='brute', n_jobs=-1)"
	]
	},
	"execution_count": 11,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"%%time\n",
	"knn_sk = skNeighbors(algorithm=\"brute\", n_jobs=-1)\n",
	"knn_sk.fit(X_train)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 12,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"CPU times: user 1min 16s, sys: 1min 30s, total: 2min 46s\n",
	"Wall time: 12.1 s\n"
	]
	}
	],
	"source": [
	"%%time\n",
	"D_sk, I_sk = knn_sk.kneighbors(X_query, n_neighbors)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Compare Results\n",
	"\n",
	"cuML currently uses FAISS for exact nearest neighbors search, which limits inputs to single-precision. This results in possible round-off errors when floats of different magnitude are added. As a result, it's very likely that the cuML results will not match Scikit-learn's nearest neighbors exactly. You can read more in the [FAISS wiki](https://github.com/facebookresearch/faiss/wiki/FAQ#why-do-i-get-weird-results-with-brute-force-search-on-vectors-with-large-components).\n",
	"\n",
	"### Distances"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 13,
	"metadata": {},
	"outputs": [],
	"source": [
	"# passed = np.allclose(D_sk, D_cuml.compute().as_gpu_matrix(), atol=1e-3)\n",
	"# print('compare knn: cuml vs sklearn distances %s'%('equal'if passed else 'NOT equal'))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Indices"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 14,
	"metadata": {},
	"outputs": [],
	"source": [
	"# sk_sorted = np.sort(I_sk, axis=1)\n",
	"# cuml_sorted = np.sort(I_cuml.compute().as_gpu_matrix(), axis=1)\n",
	"\n",
	"# diff = sk_sorted - cuml_sorted\n",
	"\n",
	"# # Pass if differences are less than .1%\n",
	"# passed = (len(diff[diff!=0]) / n_query) < 1e-2\n",
	"# print('compare knn: cuml vs sklearn indexes %s'%('equal'if passed else 'NOT equal'))"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.8.6"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 4
	}
No results found