shoyer · November 25, 2023 03:59
diff --git a/xarray-tutorial-cleared.ipynb b/xarray-tutorial-cleared.ipynb
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# xarray tutorial (with answers!)\n",
    "\n",
    "[Stephan Hoyer](http://stephanhoyer.com), Rossbypalooza, 2016\n",
    "\n",
    "-------------\n",
    "\n",
    "This notebook introduces xarray for new users in the geophysical sciences.\n",
    "\n",
    "## For more information about xarray\n",
    "\n",
    "- Read the [online documentation](http://xarray.pydata.org/)\n",
    "- Ask questions on [StackOverflow](http://stackoverflow.com/questions/tagged/python-xarray)\n",
    "- View the source code and file bug reports on [GitHub](http://github.com/pydata/xarray/)\n",
    "\n",
    "## For more doing data analysis with Python:\n",
    "\n",
    "- Thomas Wiecki, [A modern guide to getting started with Data Science and Python](http://twiecki.github.io/blog/2014/11/18/python-for-data-science/)\n",
    "- Wes McKinney, [Python for Data Analysis](http://shop.oreilly.com/product/0636920023784.do) (book)\n",
    "\n",
    "## Packages building on xarray for the geophysical sciences\n",
    "\n",
    "For analyzing GCM output:\n",
    "\n",
    "- [xgcm](https://github.com/xgcm/xgcm) by Ryan Abernathey\n",
    "- [oogcm](https://github.com/lesommer/oocgcm) by Julien Le Sommer\n",
    "- [MPAS xarray](https://github.com/pwolfram/mpas_xarray) by Phil Wolfram\n",
    "- [marc_analysis](https://github.com/darothen/marc_analysis) by Daniel Rothenberg\n",
    "\n",
    "Other tools:\n",
    "\n",
    "- [windspharm](https://github.com/ajdawson/windspharm): wind spherical harmonics by Andrew Dawson\n",
    "- [eofs](https://github.com/ajdawson/eofs): empirical orthogonal functions by Andrew Dawson\n",
    "- [infinite-diff](https://github.com/spencerahill/infinite-diff) by Spencer Hill \n",
    "- [aospy](https://github.com/spencerahill/aospy) by Spencer Hill and Spencer Clark\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can find these notebooks in `/project/rossby/Stephan-IPython` on Midway"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "-------------------\n",
    "\n",
    "## xarray basics"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# standard imports\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import xarray as xr\n",
    "\n",
    "%matplotlib inline\n",
    "\n",
    "np.set_printoptions(precision=3, linewidth=80, edgeitems=1)  # make numpy less verbose\n",
    "xr.set_options(line_width=70)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### `xarray.Dataset` is like a Python dictionary (of `xarray.DataArray` objects)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll use the \"air_temperature\" tutorial dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "ds = xr.tutorial.load_dataset('air_temperature')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "ds"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "ds.air"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "list(ds.keys())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "ds.dims"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "ds.attrs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "ds['air'].identical(ds.air)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "ds.air.values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "ds.air.dims"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "ds.air.attrs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Reading and writing netCDF"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Under the covers, this uses scipy or the [netCDF4-Python](https://github.com/unidata/netcdf4-python) library:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "ds.to_netcdf('another-copy-2.nc')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "xr.open_dataset('another-copy-2.nc')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "! ncdump -h another-copy.nc"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Indexing with named dimensions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "ds.air.sel(time='2013-01-01')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "ds.air.sel(lat=slice(60, 50), lon=slice(200, 270))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "ds.air.sel(lat=41.8781, lon=360-87.6298, method='nearest', tolerance=5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Computation\n",
    "\n",
    "You can do arithmetic directly on `Dataset` and `DataArray` objects. Labels are preserved, although attributes removed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "2 * ds"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also apply NumPy \"universal functions\" like `np.sqrt` to `DataArray` objects:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "np.sqrt(ds.air)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "xarray also implements standard aggregation functions:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "ds.max()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "ds.mean(dim='time')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "ds.median(dim=['lat', 'lon'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercises\n",
    "\n",
    "Calculate the maximum air surface temperature over time for San Francisco, CA (latitude=37.7749, longitude=122.4194)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "ds.sel(lat=37.77, lon=360-122.419, method='nearest', tolerance=2).max()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Convert the dataset from Kelvin to degrees Celsius and save to a new netCDF file.\n",
    "\n",
    "Don't forget to fix the temperature units! Recall `degC = degK - 273`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "ds_celsius = ds - 273\n",
    "ds_celsius.air.attrs['units'] = 'kelvin'\n",
    "ds_celsius.to_netcdf('temperature-in-kelvin.nc')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----------------------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## xarray comes with built-in plotting, based on matplotlib"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "ds.mean(['lat', 'lon']).air.plot()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "ds.min('time').air.plot()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Time series"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Xarray implements the \"split-apply-combine\" paradigm with `groupby`. This works really well for calculating climatologies:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "ds.groupby('time.season').mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "clim = ds.groupby('time.month').mean('time')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "clim"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also do arithmetic with groupby objects, which repeats the arithmetic over each group:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "anomalies = ds.groupby('time.month') - clim"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "anomalies"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Resample adjusts a time series to a new resolution:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "tmin = ds.air.resample('1D', dim='time', how='min')\n",
    "tmax = ds.air.resample('1D', dim='time', how='max')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "tmin"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "ds_extremes = xr.Dataset({'tmin': tmin, 'tmax': tmax})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "ds_extremes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Pandas\n",
    "\n",
    "[Pandas](http://pandas.pydata.org) is the best way to work with tabular data (e.g., CSV files) in Python. It's also a highly flexible data analysis tool, with way more functionality than xarray."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "df = ds.to_dataframe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Pandas provides very robust tools for reading and writing CSV:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print(df.head(10).to_csv())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Of course, it's just as easy to convert back from pandas:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "xr.Dataset.from_dataframe(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you're using pandas 0.18 or newer, you can write `df.to_xarray()`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Things you can do with pandas"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "df.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "df.sample(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Statistical visualization with [Seaborn](https://stanford.edu/~mwaskom/software/seaborn/):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import seaborn as sns\n",
    "\n",
    "data = (ds_extremes\n",
    "        .sel_points(lat=[41.8781, 37.7749], lon=[360-87.6298, 360-122.4194],\n",
    "                    method='nearest', tolerance=3,\n",
    "                    dim=xr.DataArray(['Chicago', 'San Francisco'],\n",
    "                                     name='location', dims='location'))\n",
    "        .to_dataframe()\n",
    "        .reset_index()\n",
    "        .assign(month=lambda x: x.time.dt.month))\n",
    "\n",
    "plt.figure(figsize=(10, 5))\n",
    "sns.violinplot('month', 'tmax', 'location', data=data, split=True, inner=None)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "-------------------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Calculate anomalies for `tmin`. Plot a 2D map of these anomalies for `2014-12-31`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "tmin_clim = tmin.groupby('time.month').mean('time')\n",
    "tmin_anom = tmin.groupby('time.month') - tmin_clim\n",
    "tmin_anom.sel(time='2014-12-31').plot()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "-------------------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## xarray also works for data that doesn't fit in memory"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here's a quick demo of [how xarray can leverage dask](http://xarray.pydata.org/en/stable/dask.html) to work with data that doesn't fit in memory. This lets xarray substitute for tools like `cdo` and `nco`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "! ls /project/rossby/datasets/Tiffany"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "! ls /project/rossby/datasets/Tiffany/T925"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "! rsync -a /project/rossby/datasets/Tiffany/T925 /scratch/local/era-interim"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Tell dask we want to use 4 threads (one for each core we have):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import dask\n",
    "from multiprocessing.pool import ThreadPool\n",
    "\n",
    "dask.set_options(pool=ThreadPool(4))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Open a bunch of netCDF files from disk using `xarray.open_mfdataset`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "ds = xr.open_mfdataset('/scratch/local/era-interim/T925/*.nc', engine='scipy',\n",
    "                       chunks={'time': 100, 'latitude': 121, 'longitude': 121})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "ds"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "ds.nbytes * (2 ** -30)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%time ds_seasonal = ds.groupby('time.season').mean('time')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%time ds_seasonal.load()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "(ds_seasonal['t']\n",
    " .sel(season=['DJF', 'MAM', 'JJA', 'SON'])\n",
    " .plot(col='season', size=3, cmap='Spectral_r'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For more details, read this blog post: http://continuum.io/blog/xray-dask"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Raw Cell Format",
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
 }
diff --git a/xarray-tutorial-with-answers.ipynb b/xarray-tutorial-with-answers.ipynb
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# xarray tutorial (with answers!)\n",
	"\n",
	"[Stephan Hoyer](http://stephanhoyer.com), Rossbypalooza, 2016\n",
	"\n",
	"-------------\n",
	"\n",
	"This notebook introduces xarray for new users in the geophysical sciences.\n",
	"\n",
	"## For more information about xarray\n",
	"\n",
	"- Read the [online documentation](http://xarray.pydata.org/)\n",
	"- Ask questions on [StackOverflow](http://stackoverflow.com/questions/tagged/python-xarray)\n",
	"- View the source code and file bug reports on [GitHub](http://github.com/pydata/xarray/)\n",
	"\n",
	"## For more doing data analysis with Python:\n",
	"\n",
	"- Thomas Wiecki, [A modern guide to getting started with Data Science and Python](http://twiecki.github.io/blog/2014/11/18/python-for-data-science/)\n",
	"- Wes McKinney, [Python for Data Analysis](http://shop.oreilly.com/product/0636920023784.do) (book)\n",
	"\n",
	"## Packages building on xarray for the geophysical sciences\n",
	"\n",
	"For analyzing GCM output:\n",
	"\n",
	"- [xgcm](https://github.com/xgcm/xgcm) by Ryan Abernathey\n",
	"- [oogcm](https://github.com/lesommer/oocgcm) by Julien Le Sommer\n",
	"- [MPAS xarray](https://github.com/pwolfram/mpas_xarray) by Phil Wolfram\n",
	"- [marc_analysis](https://github.com/darothen/marc_analysis) by Daniel Rothenberg\n",
	"\n",
	"Other tools:\n",
	"\n",
	"- [windspharm](https://github.com/ajdawson/windspharm): wind spherical harmonics by Andrew Dawson\n",
	"- [eofs](https://github.com/ajdawson/eofs): empirical orthogonal functions by Andrew Dawson\n",
	"- [infinite-diff](https://github.com/spencerahill/infinite-diff) by Spencer Hill \n",
	"- [aospy](https://github.com/spencerahill/aospy) by Spencer Hill and Spencer Clark\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"You can find these notebooks in `/project/rossby/Stephan-IPython` on Midway"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"-------------------\n",
	"\n",
	"## xarray basics"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"# standard imports\n",
	"import numpy as np\n",
	"import pandas as pd\n",
	"import matplotlib.pyplot as plt\n",
	"import xarray as xr\n",
	"\n",
	"%matplotlib inline\n",
	"\n",
	"np.set_printoptions(precision=3, linewidth=80, edgeitems=1) # make numpy less verbose\n",
	"xr.set_options(line_width=70)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"#### `xarray.Dataset` is like a Python dictionary (of `xarray.DataArray` objects)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"We'll use the \"air_temperature\" tutorial dataset:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"ds = xr.tutorial.load_dataset('air_temperature')"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"ds"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"ds.air"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"list(ds.keys())"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"ds.dims"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"ds.attrs"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"ds['air'].identical(ds.air)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"ds.air.values"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"ds.air.dims"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"ds.air.attrs"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"#### Reading and writing netCDF"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Under the covers, this uses scipy or the [netCDF4-Python](https://github.com/unidata/netcdf4-python) library:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"ds.to_netcdf('another-copy-2.nc')"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"xr.open_dataset('another-copy-2.nc')"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"! ncdump -h another-copy.nc"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Indexing with named dimensions"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"ds.air.sel(time='2013-01-01')"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"ds.air.sel(lat=slice(60, 50), lon=slice(200, 270))"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"ds.air.sel(lat=41.8781, lon=360-87.6298, method='nearest', tolerance=5)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Computation\n",
	"\n",
	"You can do arithmetic directly on `Dataset` and `DataArray` objects. Labels are preserved, although attributes removed."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"2 * ds"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"You can also apply NumPy \"universal functions\" like `np.sqrt` to `DataArray` objects:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"np.sqrt(ds.air)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"xarray also implements standard aggregation functions:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"ds.max()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"ds.mean(dim='time')"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"ds.median(dim=['lat', 'lon'])"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Exercises\n",
	"\n",
	"Calculate the maximum air surface temperature over time for San Francisco, CA (latitude=37.7749, longitude=122.4194)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"ds.sel(lat=37.77, lon=360-122.419, method='nearest', tolerance=2).max()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Convert the dataset from Kelvin to degrees Celsius and save to a new netCDF file.\n",
	"\n",
	"Don't forget to fix the temperature units! Recall `degC = degK - 273`."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"ds_celsius = ds - 273\n",
	"ds_celsius.air.attrs['units'] = 'kelvin'\n",
	"ds_celsius.to_netcdf('temperature-in-kelvin.nc')"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"----------------------"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## xarray comes with built-in plotting, based on matplotlib"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"ds.mean(['lat', 'lon']).air.plot()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"ds.min('time').air.plot()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Time series"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Xarray implements the \"split-apply-combine\" paradigm with `groupby`. This works really well for calculating climatologies:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"ds.groupby('time.season').mean()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"clim = ds.groupby('time.month').mean('time')"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"clim"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"You can also do arithmetic with groupby objects, which repeats the arithmetic over each group:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"anomalies = ds.groupby('time.month') - clim"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"anomalies"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Resample adjusts a time series to a new resolution:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"tmin = ds.air.resample('1D', dim='time', how='min')\n",
	"tmax = ds.air.resample('1D', dim='time', how='max')"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"tmin"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"ds_extremes = xr.Dataset({'tmin': tmin, 'tmax': tmax})"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"ds_extremes"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Pandas\n",
	"\n",
	"[Pandas](http://pandas.pydata.org) is the best way to work with tabular data (e.g., CSV files) in Python. It's also a highly flexible data analysis tool, with way more functionality than xarray."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false,
	"scrolled": false
	},
	"outputs": [],
	"source": [
	"df = ds.to_dataframe()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"df.head()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Pandas provides very robust tools for reading and writing CSV:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"print(df.head(10).to_csv())"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Of course, it's just as easy to convert back from pandas:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"xr.Dataset.from_dataframe(df)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"If you're using pandas 0.18 or newer, you can write `df.to_xarray()`"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Things you can do with pandas"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"df.describe()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"df.sample(10)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Statistical visualization with [Seaborn](https://stanford.edu/~mwaskom/software/seaborn/):"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"import seaborn as sns\n",
	"\n",
	"data = (ds_extremes\n",
	" .sel_points(lat=[41.8781, 37.7749], lon=[360-87.6298, 360-122.4194],\n",
	" method='nearest', tolerance=3,\n",
	" dim=xr.DataArray(['Chicago', 'San Francisco'],\n",
	" name='location', dims='location'))\n",
	" .to_dataframe()\n",
	" .reset_index()\n",
	" .assign(month=lambda x: x.time.dt.month))\n",
	"\n",
	"plt.figure(figsize=(10, 5))\n",
	"sns.violinplot('month', 'tmax', 'location', data=data, split=True, inner=None)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"-------------------"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Exercise"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Calculate anomalies for `tmin`. Plot a 2D map of these anomalies for `2014-12-31`."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"tmin_clim = tmin.groupby('time.month').mean('time')\n",
	"tmin_anom = tmin.groupby('time.month') - tmin_clim\n",
	"tmin_anom.sel(time='2014-12-31').plot()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"-------------------"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## xarray also works for data that doesn't fit in memory"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Here's a quick demo of [how xarray can leverage dask](http://xarray.pydata.org/en/stable/dask.html) to work with data that doesn't fit in memory. This lets xarray substitute for tools like `cdo` and `nco`."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"! ls /project/rossby/datasets/Tiffany"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"! ls /project/rossby/datasets/Tiffany/T925"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"! rsync -a /project/rossby/datasets/Tiffany/T925 /scratch/local/era-interim"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Tell dask we want to use 4 threads (one for each core we have):"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"import dask\n",
	"from multiprocessing.pool import ThreadPool\n",
	"\n",
	"dask.set_options(pool=ThreadPool(4))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Open a bunch of netCDF files from disk using `xarray.open_mfdataset`:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"ds = xr.open_mfdataset('/scratch/local/era-interim/T925/*.nc', engine='scipy',\n",
	" chunks={'time': 100, 'latitude': 121, 'longitude': 121})"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"ds"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"ds.nbytes * (2 ** -30)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"%time ds_seasonal = ds.groupby('time.season').mean('time')"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"%time ds_seasonal.load()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"(ds_seasonal['t']\n",
	" .sel(season=['DJF', 'MAM', 'JJA', 'SON'])\n",
	" .plot(col='season', size=3, cmap='Spectral_r'))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"For more details, read this blog post: http://continuum.io/blog/xray-dask"
	]
	}
	],
	"metadata": {
	"celltoolbar": "Raw Cell Format",
	"kernelspec": {
	"display_name": "Python 2",
	"language": "python",
	"name": "python2"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 2
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython2",
	"version": "2.7.9"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 0
	}