jamesp · May 15, 2018 10:03
diff --git a/dask.ipynb b/dask.ipynb
diff --git a/xarray.ipynb b/xarray.ipynb
 {
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "\n",
    "import xarray as xr\n",
    "import numpy as np\n",
    "\n",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Numpy Arrays\n",
    "\n",
    "Numpy arrays are a very efficient way of working with multi-dimensional data in Python.  \n",
    "* They allow us to perform vectorised operations very efficiently.\n",
    "* The code looks the same whether we are working on a single value, or an entire list."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Python lists are slow!!!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "xs = range(1000000)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "square_xs = []\n",
    "for x in xs:\n",
    "    square_xs.append(x**2)\n",
    "\n",
    "None"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%time [x**2 for x in xs]\n",
    "None"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nxs = np.asarray(xs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "nxs**2\n",
    "None"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "numpy arrays can be multi-dimensional.  For example, we can create a 3-D array (x, y, and time) and fill it with data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# create a dataset that varies in time and space\n",
    "# choose a sample spacing.  Remember the array will have a memory size of nt*nx*ny*nbytes(float64)\n",
    "nt = 100\n",
    "nx = 256\n",
    "ny = 128"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# locations of the samples\n",
    "t = np.arange(1, nt+1)\n",
    "x = np.linspace(0, np.pi*2, nx, endpoint=False)\n",
    "y = np.linspace(-1, 1, ny)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# create empty dimensions to make coordinate arrays easier to work with\n",
    "tt = t[:, np.newaxis, np.newaxis]\n",
    "yy = y[np.newaxis, :, np.newaxis]\n",
    "xx = x[np.newaxis, np.newaxis, :]\n",
    "xx.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = np.exp(-(yy/.3)**2)*np.cos(3*(xx)) + np.random.normal(size=(nt,ny,nx))\n",
    "print(\"Size of data: %.1f MB\" % (data.nbytes/1e6))\n",
    "print(\"Shape of data: %r\" % (data.shape,))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data[0,0,0:3]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# show the field at time=1\n",
    "plt.imshow(data[0, :, :])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# mean of the whole field\n",
    "data.mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# find the mean in time\n",
    "plt.plot(data.mean(axis=(1,2)))\n",
    "plt.xlabel('time')\n",
    "plt.ylabel('mean val')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Surely there must be a better way?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "d = xr.DataArray(data, coords=(('time', t), ('y', y), ('x', x)))\n",
    "d"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "d.sel(time=1).plot()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "d.mean('time').plot()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Chaining operations\n",
    "\n",
    "Typically most operations will return a new DataArray, so they can be chained together to select and transform your data.\n",
    "\n",
    "The order you perform the operations may or may not matter, but typically it is better to do selection operations earlier, to reduce the amount of calculation you are doing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "dmean = d.mean('time')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "d.sel(x=3.14, method='nearest').mean('time').plot()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "All three of these methods are equivalent.  Arithmetic operations on xarray DataArrays return DataArrays, so operations can be chained.\n",
    "\n",
    "The `pipe` method is also very useful. It expects a function that takes your data object, manipulates it, and returns a new DataArray."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, (ax1, ax2, ax3) = plt.subplots(ncols=3, figsize=(9, 3))\n",
    "\n",
    "np.sqrt(d**2).mean(('time', 'x')).plot(ax=ax1)\n",
    "\n",
    "d.pipe(abs).mean(('time', 'x')).plot(ax=ax2)\n",
    "\n",
    "d.pipe(lambda x: np.sqrt(x**2)).mean(('time', 'x')).plot(ax=ax3)\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "d.mean('time').pipe(abs).mean('x').plot()\n",
    "(np.exp(-(d.y/.3)**2)*2/np.pi).plot(linestyle='--')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# DataSets\n",
    "\n",
    "DataSets are collections of DataArrays with shared coordinates."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ds = xr.open_dataset('/scratch/jp492/gfdl_data/dry_earth/run0034/daily.nc', decode_times=False)\n",
    "ds"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ds.t_surf.mean('time').plot.contourf(levels=22)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ds.lon"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ds.ucomp.mean(('time', 'lon')).plot.contourf(levels=22)\n",
    "plt.ylim(ds.pfull.max(), ds.pfull.min())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Subsetting a dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "atmosphere = ds[['ucomp', 'vcomp', 'temp']]\n",
    "atmosphere\n",
    "\n",
    "# can also use ds.drop(['a', 'b']) to remove specific DataArrays"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As for fields, we can perform reduction operations on the whole dataset at one time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ds.mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ds.mean([dim for dim in ds.coords if dim != 'time']).temp.plot()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }
diff --git a/xarray_dask.ipynb b/xarray_dask.ipynb
	{
	"cells": [
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"%matplotlib inline\n",
	"\n",
	"import xarray as xr\n",
	"import numpy as np\n",
	"\n",
	"import matplotlib.pyplot as plt"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Numpy Arrays\n",
	"\n",
	"Numpy arrays are a very efficient way of working with multi-dimensional data in Python. \n",
	"* They allow us to perform vectorised operations very efficiently.\n",
	"* The code looks the same whether we are working on a single value, or an entire list."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Python lists are slow!!!"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"xs = range(1000000)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"%%time\n",
	"square_xs = []\n",
	"for x in xs:\n",
	" square_xs.append(x**2)\n",
	"\n",
	"None"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"%time [x**2 for x in xs]\n",
	"None"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"nxs = np.asarray(xs)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"%%time\n",
	"nxs**2\n",
	"None"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"numpy arrays can be multi-dimensional. For example, we can create a 3-D array (x, y, and time) and fill it with data."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"# create a dataset that varies in time and space\n",
	"# choose a sample spacing. Remember the array will have a memory size of ntnxny*nbytes(float64)\n",
	"nt = 100\n",
	"nx = 256\n",
	"ny = 128"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"# locations of the samples\n",
	"t = np.arange(1, nt+1)\n",
	"x = np.linspace(0, np.pi*2, nx, endpoint=False)\n",
	"y = np.linspace(-1, 1, ny)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"x.shape"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"# create empty dimensions to make coordinate arrays easier to work with\n",
	"tt = t[:, np.newaxis, np.newaxis]\n",
	"yy = y[np.newaxis, :, np.newaxis]\n",
	"xx = x[np.newaxis, np.newaxis, :]\n",
	"xx.shape"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"data = np.exp(-(yy/.3)*2)np.cos(3*(xx)) + np.random.normal(size=(nt,ny,nx))\n",
	"print(\"Size of data: %.1f MB\" % (data.nbytes/1e6))\n",
	"print(\"Shape of data: %r\" % (data.shape,))"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"data[0,0,0:3]"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"# show the field at time=1\n",
	"plt.imshow(data[0, :, :])"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"# mean of the whole field\n",
	"data.mean()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"# find the mean in time\n",
	"plt.plot(data.mean(axis=(1,2)))\n",
	"plt.xlabel('time')\n",
	"plt.ylabel('mean val')"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Surely there must be a better way?"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"d = xr.DataArray(data, coords=(('time', t), ('y', y), ('x', x)))\n",
	"d"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"d.sel(time=1).plot()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"d.mean('time').plot()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Chaining operations\n",
	"\n",
	"Typically most operations will return a new DataArray, so they can be chained together to select and transform your data.\n",
	"\n",
	"The order you perform the operations may or may not matter, but typically it is better to do selection operations earlier, to reduce the amount of calculation you are doing."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"%%time\n",
	"dmean = d.mean('time')"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"%%time\n",
	"d.sel(x=3.14, method='nearest').mean('time').plot()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"All three of these methods are equivalent. Arithmetic operations on xarray DataArrays return DataArrays, so operations can be chained.\n",
	"\n",
	"The `pipe` method is also very useful. It expects a function that takes your data object, manipulates it, and returns a new DataArray."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"fig, (ax1, ax2, ax3) = plt.subplots(ncols=3, figsize=(9, 3))\n",
	"\n",
	"np.sqrt(d**2).mean(('time', 'x')).plot(ax=ax1)\n",
	"\n",
	"d.pipe(abs).mean(('time', 'x')).plot(ax=ax2)\n",
	"\n",
	"d.pipe(lambda x: np.sqrt(x**2)).mean(('time', 'x')).plot(ax=ax3)\n",
	"\n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"d.mean('time').pipe(abs).mean('x').plot()\n",
	"(np.exp(-(d.y/.3)*2)2/np.pi).plot(linestyle='--')"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# DataSets\n",
	"\n",
	"DataSets are collections of DataArrays with shared coordinates."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"ds = xr.open_dataset('/scratch/jp492/gfdl_data/dry_earth/run0034/daily.nc', decode_times=False)\n",
	"ds"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"ds.t_surf.mean('time').plot.contourf(levels=22)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"ds.lon"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"ds.ucomp.mean(('time', 'lon')).plot.contourf(levels=22)\n",
	"plt.ylim(ds.pfull.max(), ds.pfull.min())"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Subsetting a dataset:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"atmosphere = ds[['ucomp', 'vcomp', 'temp']]\n",
	"atmosphere\n",
	"\n",
	"# can also use ds.drop(['a', 'b']) to remove specific DataArrays"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"As for fields, we can perform reduction operations on the whole dataset at one time."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"ds.mean()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"ds.mean([dim for dim in ds.coords if dim != 'time']).temp.plot()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": []
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.6.2"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}