Skip to content

Instantly share code, notes, and snippets.

@shoyer
Last active November 25, 2023 03:59
Show Gist options
  • Save shoyer/d462cc3b2aeb87bbb78cc6f8207851c6 to your computer and use it in GitHub Desktop.
Save shoyer/d462cc3b2aeb87bbb78cc6f8207851c6 to your computer and use it in GitHub Desktop.
Xarray tutorial for Rossbypalooza
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# xarray tutorial (with answers!)\n",
"\n",
"[Stephan Hoyer](http://stephanhoyer.com), Rossbypalooza, 2016\n",
"\n",
"-------------\n",
"\n",
"This notebook introduces xarray for new users in the geophysical sciences.\n",
"\n",
"## For more information about xarray\n",
"\n",
"- Read the [online documentation](http://xarray.pydata.org/)\n",
"- Ask questions on [StackOverflow](http://stackoverflow.com/questions/tagged/python-xarray)\n",
"- View the source code and file bug reports on [GitHub](http://github.com/pydata/xarray/)\n",
"\n",
"## For more doing data analysis with Python:\n",
"\n",
"- Thomas Wiecki, [A modern guide to getting started with Data Science and Python](http://twiecki.github.io/blog/2014/11/18/python-for-data-science/)\n",
"- Wes McKinney, [Python for Data Analysis](http://shop.oreilly.com/product/0636920023784.do) (book)\n",
"\n",
"## Packages building on xarray for the geophysical sciences\n",
"\n",
"For analyzing GCM output:\n",
"\n",
"- [xgcm](https://github.com/xgcm/xgcm) by Ryan Abernathey\n",
"- [oogcm](https://github.com/lesommer/oocgcm) by Julien Le Sommer\n",
"- [MPAS xarray](https://github.com/pwolfram/mpas_xarray) by Phil Wolfram\n",
"- [marc_analysis](https://github.com/darothen/marc_analysis) by Daniel Rothenberg\n",
"\n",
"Other tools:\n",
"\n",
"- [windspharm](https://github.com/ajdawson/windspharm): wind spherical harmonics by Andrew Dawson\n",
"- [eofs](https://github.com/ajdawson/eofs): empirical orthogonal functions by Andrew Dawson\n",
"- [infinite-diff](https://github.com/spencerahill/infinite-diff) by Spencer Hill \n",
"- [aospy](https://github.com/spencerahill/aospy) by Spencer Hill and Spencer Clark\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can find these notebooks in `/project/rossby/Stephan-IPython` on Midway"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"-------------------\n",
"\n",
"## xarray basics"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# standard imports\n",
"import numpy as np\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"import xarray as xr\n",
"\n",
"%matplotlib inline\n",
"\n",
"np.set_printoptions(precision=3, linewidth=80, edgeitems=1) # make numpy less verbose\n",
"xr.set_options(line_width=70)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### `xarray.Dataset` is like a Python dictionary (of `xarray.DataArray` objects)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We'll use the \"air_temperature\" tutorial dataset:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"ds = xr.tutorial.load_dataset('air_temperature')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"ds"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"ds.air"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"list(ds.keys())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"ds.dims"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"ds.attrs"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"ds['air'].identical(ds.air)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"ds.air.values"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"ds.air.dims"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"ds.air.attrs"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Reading and writing netCDF"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Under the covers, this uses scipy or the [netCDF4-Python](https://github.com/unidata/netcdf4-python) library:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"ds.to_netcdf('another-copy-2.nc')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"xr.open_dataset('another-copy-2.nc')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"! ncdump -h another-copy.nc"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Indexing with named dimensions"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"ds.air.sel(time='2013-01-01')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"ds.air.sel(lat=slice(60, 50), lon=slice(200, 270))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"ds.air.sel(lat=41.8781, lon=360-87.6298, method='nearest', tolerance=5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Computation\n",
"\n",
"You can do arithmetic directly on `Dataset` and `DataArray` objects. Labels are preserved, although attributes removed."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"2 * ds"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can also apply NumPy \"universal functions\" like `np.sqrt` to `DataArray` objects:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"np.sqrt(ds.air)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"xarray also implements standard aggregation functions:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"ds.max()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"ds.mean(dim='time')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"ds.median(dim=['lat', 'lon'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exercises\n",
"\n",
"Calculate the maximum air surface temperature over time for San Francisco, CA (latitude=37.7749, longitude=122.4194)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"ds.sel(lat=37.77, lon=360-122.419, method='nearest', tolerance=2).max()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Convert the dataset from Kelvin to degrees Celsius and save to a new netCDF file.\n",
"\n",
"Don't forget to fix the temperature units! Recall `degC = degK - 273`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"ds_celsius = ds - 273\n",
"ds_celsius.air.attrs['units'] = 'kelvin'\n",
"ds_celsius.to_netcdf('temperature-in-kelvin.nc')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"----------------------"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## xarray comes with built-in plotting, based on matplotlib"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"ds.mean(['lat', 'lon']).air.plot()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"ds.min('time').air.plot()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Time series"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Xarray implements the \"split-apply-combine\" paradigm with `groupby`. This works really well for calculating climatologies:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"ds.groupby('time.season').mean()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"clim = ds.groupby('time.month').mean('time')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"clim"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can also do arithmetic with groupby objects, which repeats the arithmetic over each group:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"anomalies = ds.groupby('time.month') - clim"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"anomalies"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Resample adjusts a time series to a new resolution:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"tmin = ds.air.resample('1D', dim='time', how='min')\n",
"tmax = ds.air.resample('1D', dim='time', how='max')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"tmin"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"ds_extremes = xr.Dataset({'tmin': tmin, 'tmax': tmax})"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"ds_extremes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Pandas\n",
"\n",
"[Pandas](http://pandas.pydata.org) is the best way to work with tabular data (e.g., CSV files) in Python. It's also a highly flexible data analysis tool, with way more functionality than xarray."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"scrolled": false
},
"outputs": [],
"source": [
"df = ds.to_dataframe()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Pandas provides very robust tools for reading and writing CSV:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print(df.head(10).to_csv())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Of course, it's just as easy to convert back from pandas:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"xr.Dataset.from_dataframe(df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you're using pandas 0.18 or newer, you can write `df.to_xarray()`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Things you can do with pandas"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df.describe()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df.sample(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Statistical visualization with [Seaborn](https://stanford.edu/~mwaskom/software/seaborn/):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import seaborn as sns\n",
"\n",
"data = (ds_extremes\n",
" .sel_points(lat=[41.8781, 37.7749], lon=[360-87.6298, 360-122.4194],\n",
" method='nearest', tolerance=3,\n",
" dim=xr.DataArray(['Chicago', 'San Francisco'],\n",
" name='location', dims='location'))\n",
" .to_dataframe()\n",
" .reset_index()\n",
" .assign(month=lambda x: x.time.dt.month))\n",
"\n",
"plt.figure(figsize=(10, 5))\n",
"sns.violinplot('month', 'tmax', 'location', data=data, split=True, inner=None)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"-------------------"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exercise"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Calculate anomalies for `tmin`. Plot a 2D map of these anomalies for `2014-12-31`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"tmin_clim = tmin.groupby('time.month').mean('time')\n",
"tmin_anom = tmin.groupby('time.month') - tmin_clim\n",
"tmin_anom.sel(time='2014-12-31').plot()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"-------------------"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## xarray also works for data that doesn't fit in memory"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here's a quick demo of [how xarray can leverage dask](http://xarray.pydata.org/en/stable/dask.html) to work with data that doesn't fit in memory. This lets xarray substitute for tools like `cdo` and `nco`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"! ls /project/rossby/datasets/Tiffany"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"! ls /project/rossby/datasets/Tiffany/T925"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"! rsync -a /project/rossby/datasets/Tiffany/T925 /scratch/local/era-interim"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Tell dask we want to use 4 threads (one for each core we have):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import dask\n",
"from multiprocessing.pool import ThreadPool\n",
"\n",
"dask.set_options(pool=ThreadPool(4))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Open a bunch of netCDF files from disk using `xarray.open_mfdataset`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"ds = xr.open_mfdataset('/scratch/local/era-interim/T925/*.nc', engine='scipy',\n",
" chunks={'time': 100, 'latitude': 121, 'longitude': 121})"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"ds"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"ds.nbytes * (2 ** -30)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%time ds_seasonal = ds.groupby('time.season').mean('time')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%time ds_seasonal.load()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"(ds_seasonal['t']\n",
" .sel(season=['DJF', 'MAM', 'JJA', 'SON'])\n",
" .plot(col='season', size=3, cmap='Spectral_r'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For more details, read this blog post: http://continuum.io/blog/xray-dask"
]
}
],
"metadata": {
"celltoolbar": "Raw Cell Format",
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.9"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment