jamesp · May 15, 2018 10:03
diff --git a/dask.ipynb b/dask.ipynb
 {
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "\n",
    "import xarray as xr\n",
    "import numpy as np\n",
    "import dask as da\n",
    "\n",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# dask\n",
    "\n",
    "\"Dask is a flexible parallel computing library for analytic computing.\"\n",
    "-- [dask homepage](https://dask.pydata.org/en/latest/docs.html)\n",
    "\n",
    "Dask has three primary data structures similar to those already used by data scientists:\n",
    "* `dask.dataframe`: A wrapper for `pandas.DataFrame`\n",
    "* `dask.array`: A wrapper for `numpy.array`\n",
    "* `dask.bag`: A wrapper for Python `list` and iterators.\n",
    "\n",
    "Fundamentally, `dask` allows you to specify a pipeline of functions and transforms on your data, and not worry too much about how and when those computations might be accomplished.\n",
    "\n",
    "For example, adapted from https://github.com/dask/dask-tutorial\n",
    "> survey: does everyone know what a @ decorator is?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# without dask\n",
    "\n",
    "def inc(x):\n",
    "    return x + 1\n",
    "\n",
    "def add(x, y):\n",
    "    return x + y\n",
    "\n",
    "dmax = delayed(np.max)\n",
    "\n",
    "x = inc(15)\n",
    "y = inc(abs(-30))\n",
    "total = add(x, y)\n",
    "total"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from dask import delayed\n",
    "\n",
    "@delayed\n",
    "def inc(x):\n",
    "    return x + 1\n",
    "\n",
    "@delayed\n",
    "def add(x, y):\n",
    "    return x + y\n",
    "\n",
    "dabs = delayed(abs)\n",
    "\n",
    "x = inc(15)\n",
    "y = inc(dabs(-30))\n",
    "total = add(x, y)\n",
    "total"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "total.visualize()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "total.compute()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Why?\n",
    "\n",
    "Why would you want to do this?\n",
    "\n",
    "<img src=\"dask-tutorial/images/grid_search_schedule.gif\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* There are times where your data is too large to fit into memory all in one go.  Using `dask` allows you to write code that looks the same whether your dataset has $10$ items or $10^{10}$ items.\n",
    "* You may have access to several cores, a cluster or supercomputer.  `dask` makes it trivial to distribute your computational workload over the available hardware by identifying which parts of your algorithm have to be run serially, and which can be run in parallel."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!ls -ltrh /scratch/jp492/dask-tutorial/data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Schedulerers and clients\n",
    "\n",
    "`dask.distributed` provides tools for creating and managing a local or remote cluster of compute workers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from dask.distributed import Client"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "c = Client('localhost:8786')  # we're running this on the cluster, but could also replace with a remote ip address\n",
    "c"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import time\n",
    "\n",
    "def inc(x):\n",
    "    time.sleep(5)\n",
    "    return x + 1\n",
    "\n",
    "def dec(x):\n",
    "    time.sleep(3)\n",
    "    return x - 1\n",
    "\n",
    "def add(x, y):\n",
    "    time.sleep(7)\n",
    "    return x + y"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x = delayed(inc)(1)\n",
    "y = delayed(dec)(2)\n",
    "total = delayed(add)(x, y)\n",
    "total.compute()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load data with h5py\n",
    "# this gives the load prescription, but does no real work.\n",
    "import h5py\n",
    "import os\n",
    "f = h5py.File(os.path.join('dask-tutorial', 'data', 'random.hdf5'), mode='r')\n",
    "dset = f['/x']\n",
    "dset.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "# Compute sum of large array, one million numbers at a time\n",
    "sums = []\n",
    "for i in range(0, 1000000000, 1000000):\n",
    "    chunk = dset[i: i + 1000000]  # pull out numpy array\n",
    "    sums.append(chunk.sum())\n",
    "\n",
    "total = sum(sums)\n",
    "mean = total / 1000000000\n",
    "print(mean)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import dask.array as da\n",
    "x = da.from_array(dset, chunks=(1000000,))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "x.mean().compute()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x = da.random.normal(10, 0.1, size=(20000, 20000),   # 400 million element array \n",
    "                              chunks=(1000, 1000))   # Cut into 1000x1000 sized chunks\n",
    "y = x.mean(axis=0)[::100]                            # Perform NumPy-style operations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x.nbytes / 1e9  # Gigabytes of the input processed lazily"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y.visualize()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "y.compute()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```python\n",
    "%%time \n",
    "x = np.random.normal(10, 0.1, size=(20000, 20000)) \n",
    "y = x.mean(axis=0)[::100] \n",
    "y\n",
    "\n",
    "CPU times: user 13.4 s, sys: 10.9 s, total: 24.3 s\n",
    "Wall time: 36 s```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }
diff --git a/xarray.ipynb b/xarray.ipynb
diff --git a/xarray_dask.ipynb b/xarray_dask.ipynb
	{
	"cells": [
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"%matplotlib inline\n",
	"\n",
	"import xarray as xr\n",
	"import numpy as np\n",
	"import dask as da\n",
	"\n",
	"import matplotlib.pyplot as plt"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# dask\n",
	"\n",
	"\"Dask is a flexible parallel computing library for analytic computing.\"\n",
	"-- [dask homepage](https://dask.pydata.org/en/latest/docs.html)\n",
	"\n",
	"Dask has three primary data structures similar to those already used by data scientists:\n",
	"* `dask.dataframe`: A wrapper for `pandas.DataFrame`\n",
	"* `dask.array`: A wrapper for `numpy.array`\n",
	"* `dask.bag`: A wrapper for Python `list` and iterators.\n",
	"\n",
	"Fundamentally, `dask` allows you to specify a pipeline of functions and transforms on your data, and not worry too much about how and when those computations might be accomplished.\n",
	"\n",
	"For example, adapted from https://github.com/dask/dask-tutorial\n",
	"> survey: does everyone know what a @ decorator is?"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"# without dask\n",
	"\n",
	"def inc(x):\n",
	" return x + 1\n",
	"\n",
	"def add(x, y):\n",
	" return x + y\n",
	"\n",
	"dmax = delayed(np.max)\n",
	"\n",
	"x = inc(15)\n",
	"y = inc(abs(-30))\n",
	"total = add(x, y)\n",
	"total"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"from dask import delayed\n",
	"\n",
	"@delayed\n",
	"def inc(x):\n",
	" return x + 1\n",
	"\n",
	"@delayed\n",
	"def add(x, y):\n",
	" return x + y\n",
	"\n",
	"dabs = delayed(abs)\n",
	"\n",
	"x = inc(15)\n",
	"y = inc(dabs(-30))\n",
	"total = add(x, y)\n",
	"total"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"total.visualize()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"total.compute()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Why?\n",
	"\n",
	"Why would you want to do this?\n",
	"\n",
	"<img src=\"dask-tutorial/images/grid_search_schedule.gif\">"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"* There are times where your data is too large to fit into memory all in one go. Using `dask` allows you to write code that looks the same whether your dataset has $10$ items or $10^{10}$ items.\n",
	"* You may have access to several cores, a cluster or supercomputer. `dask` makes it trivial to distribute your computational workload over the available hardware by identifying which parts of your algorithm have to be run serially, and which can be run in parallel."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"!ls -ltrh /scratch/jp492/dask-tutorial/data"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Schedulerers and clients\n",
	"\n",
	"`dask.distributed` provides tools for creating and managing a local or remote cluster of compute workers."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"from dask.distributed import Client"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"c = Client('localhost:8786') # we're running this on the cluster, but could also replace with a remote ip address\n",
	"c"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"import time\n",
	"\n",
	"def inc(x):\n",
	" time.sleep(5)\n",
	" return x + 1\n",
	"\n",
	"def dec(x):\n",
	" time.sleep(3)\n",
	" return x - 1\n",
	"\n",
	"def add(x, y):\n",
	" time.sleep(7)\n",
	" return x + y"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"x = delayed(inc)(1)\n",
	"y = delayed(dec)(2)\n",
	"total = delayed(add)(x, y)\n",
	"total.compute()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"# Load data with h5py\n",
	"# this gives the load prescription, but does no real work.\n",
	"import h5py\n",
	"import os\n",
	"f = h5py.File(os.path.join('dask-tutorial', 'data', 'random.hdf5'), mode='r')\n",
	"dset = f['/x']\n",
	"dset.shape"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"%%time\n",
	"# Compute sum of large array, one million numbers at a time\n",
	"sums = []\n",
	"for i in range(0, 1000000000, 1000000):\n",
	" chunk = dset[i: i + 1000000] # pull out numpy array\n",
	" sums.append(chunk.sum())\n",
	"\n",
	"total = sum(sums)\n",
	"mean = total / 1000000000\n",
	"print(mean)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"import dask.array as da\n",
	"x = da.from_array(dset, chunks=(1000000,))"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"%%time\n",
	"x.mean().compute()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"x = da.random.normal(10, 0.1, size=(20000, 20000), # 400 million element array \n",
	" chunks=(1000, 1000)) # Cut into 1000x1000 sized chunks\n",
	"y = x.mean(axis=0)[::100] # Perform NumPy-style operations"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"x.nbytes / 1e9 # Gigabytes of the input processed lazily"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"y.visualize()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"%%time\n",
	"y.compute()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"```python\n",
	"%%time \n",
	"x = np.random.normal(10, 0.1, size=(20000, 20000)) \n",
	"y = x.mean(axis=0)[::100] \n",
	"y\n",
	"\n",
	"CPU times: user 13.4 s, sys: 10.9 s, total: 24.3 s\n",
	"Wall time: 36 s```"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": []
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.6.2"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}