jacobtomlinson · February 11, 2020 16:02 · jacobtomlinson · Feb 11, 2020
diff --git a/Dask CuPy cuDF Example.ipynb b/Dask CuPy cuDF Example.ipynb
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Example of using CuPy and cuDF with Dask CUDA\n",
    "\n",
    "This notebook creates a Dask CUDA cluster using `dask_cuda.LocalCUDACluster` and then does some trivial operations on a large Dask array and dataframe backed by CuPy and cuDF respectively.\n",
    "\n",
    "This is mostly useful for stressing GPUs using Dask."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# You probably shouldn't do this, it's naughty. I'm just trying to make a tidy notebook.\n",
    "import warnings\n",
    "warnings.filterwarnings(\"ignore\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Import Dask and create cluster and client. This is also a good time to open up dashboards, etc."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import dask\n",
    "from dask.distributed import Client\n",
    "from dask_cuda import LocalCUDACluster\n",
    "cluster = LocalCUDACluster()\n",
    "client = Client(cluster)\n",
    "client"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Array Example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import cupy\n",
    "import dask.array as da"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Construct a multi-terrabyte Dask Array based on `cupy.random.RandomState`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "rs = da.random.RandomState(RandomState=cupy.random.RandomState)\n",
    "x = rs.normal(10, 1, size=(500000, 500000), chunks=(10000, 10000))\n",
    "x"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Perform some a trivial task such as taking the mean."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "x.mean().compute()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Dataframe Example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import cudf\n",
    "import dask.dataframe as dd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Construct the example timeseries dataframe from `cudf` but then wrap it in a Dask dataframe and amplify up the data volume a bit."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cdf = cudf.datasets.timeseries()\n",
    "cddf = dd.from_pandas(cdf, npartitions=1)\n",
    "cddf = dd.concat([cddf]*40)\n",
    "cddf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This should give around 100m rows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "len(cddf)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Again perform some trivial operation like a group-by followed by a mean."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "cddf.groupby('name').x.mean().compute()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
 }
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Example of using CuPy and cuDF with Dask CUDA\n",
	"\n",
	"This notebook creates a Dask CUDA cluster using `dask_cuda.LocalCUDACluster` and then does some trivial operations on a large Dask array and dataframe backed by CuPy and cuDF respectively.\n",
	"\n",
	"This is mostly useful for stressing GPUs using Dask."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"# You probably shouldn't do this, it's naughty. I'm just trying to make a tidy notebook.\n",
	"import warnings\n",
	"warnings.filterwarnings(\"ignore\")"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Import Dask and create cluster and client. This is also a good time to open up dashboards, etc."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"import dask\n",
	"from dask.distributed import Client\n",
	"from dask_cuda import LocalCUDACluster\n",
	"cluster = LocalCUDACluster()\n",
	"client = Client(cluster)\n",
	"client"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Array Example"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"import cupy\n",
	"import dask.array as da"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Construct a multi-terrabyte Dask Array based on `cupy.random.RandomState`. "
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"rs = da.random.RandomState(RandomState=cupy.random.RandomState)\n",
	"x = rs.normal(10, 1, size=(500000, 500000), chunks=(10000, 10000))\n",
	"x"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Perform some a trivial task such as taking the mean."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"%%time\n",
	"x.mean().compute()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Dataframe Example"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"import cudf\n",
	"import dask.dataframe as dd"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Construct the example timeseries dataframe from `cudf` but then wrap it in a Dask dataframe and amplify up the data volume a bit."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"cdf = cudf.datasets.timeseries()\n",
	"cddf = dd.from_pandas(cdf, npartitions=1)\n",
	"cddf = dd.concat([cddf]*40)\n",
	"cddf"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"This should give around 100m rows."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"len(cddf)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Again perform some trivial operation like a group-by followed by a mean."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"%%time\n",
	"cddf.groupby('name').x.mean().compute()"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.6.7"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 4
	}