Skip to content

Instantly share code, notes, and snippets.

@jacobtomlinson
Created February 11, 2020 16:02
Show Gist options
  • Save jacobtomlinson/8564059817ff5c737d79089d71ef4974 to your computer and use it in GitHub Desktop.
Save jacobtomlinson/8564059817ff5c737d79089d71ef4974 to your computer and use it in GitHub Desktop.
Example of using CuPy and cuDF with Dask CUDA
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Example of using CuPy and cuDF with Dask CUDA\n",
"\n",
"This notebook creates a Dask CUDA cluster using `dask_cuda.LocalCUDACluster` and then does some trivial operations on a large Dask array and dataframe backed by CuPy and cuDF respectively.\n",
"\n",
"This is mostly useful for stressing GPUs using Dask."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# You probably shouldn't do this, it's naughty. I'm just trying to make a tidy notebook.\n",
"import warnings\n",
"warnings.filterwarnings(\"ignore\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Import Dask and create cluster and client. This is also a good time to open up dashboards, etc."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import dask\n",
"from dask.distributed import Client\n",
"from dask_cuda import LocalCUDACluster\n",
"cluster = LocalCUDACluster()\n",
"client = Client(cluster)\n",
"client"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Array Example"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import cupy\n",
"import dask.array as da"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Construct a multi-terrabyte Dask Array based on `cupy.random.RandomState`. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"rs = da.random.RandomState(RandomState=cupy.random.RandomState)\n",
"x = rs.normal(10, 1, size=(500000, 500000), chunks=(10000, 10000))\n",
"x"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Perform some a trivial task such as taking the mean."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"x.mean().compute()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Dataframe Example"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import cudf\n",
"import dask.dataframe as dd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Construct the example timeseries dataframe from `cudf` but then wrap it in a Dask dataframe and amplify up the data volume a bit."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cdf = cudf.datasets.timeseries()\n",
"cddf = dd.from_pandas(cdf, npartitions=1)\n",
"cddf = dd.concat([cddf]*40)\n",
"cddf"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This should give around 100m rows."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"len(cddf)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Again perform some trivial operation like a group-by followed by a mean."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"cddf.groupby('name').x.mean().compute()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.7"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
@jacobtomlinson
Copy link
Author

GIF of this notebook in action

Mean of a 2TB CuPy Dask array in 12 seconds

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment