Last active
June 28, 2017 05:22
-
-
Save vabarbosa/76d08b1cc6f80d5fc80856a1f3f32014 to your computer and use it in GitHub Desktop.
intro to notebooks with pixiedust - part 1
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Intro to Notebooks with PixieDust \n", | |
"\n", | |
"<center>\n", | |
"<img style=\"max-width:200px; display:inline-block; padding-right:25px;\" src=\"https://libraries.mit.edu/news/files/2016/02/jupyter.png\"/>\n", | |
"<img style=\"max-width:200px; display:inline-block; padding-left:25px;\" src=\"https://github.com/ibm-watson-data-lab/pixiedust/raw/master/docs/_static/PixieDust%202C%20%28512x512%29.png\"/>\n", | |
" \n", | |
"<br/> \n", | |
"</center> \n", | |
"\n", | |
"### PART I \n", | |
"\n", | |
"* `display()` API \n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Jupyter Notebooks\n", | |
"\n", | |
"[Jupyter Notebooks](https://jupyter.org/) are a powerful tool for fast and flexible data analysis and can contain live code, equations, visualizations and explanatory text." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## PixieDust\n", | |
"\n", | |
"[PixieDust](https://github.com/ibm-cds-labs/pixiedust) is an open source Python helper library that works as an add-on to Jupyter notebooks to extends the usability of notebooks." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<br/> \n", | |
"\n", | |
"#### Install/Update PixieDust \n", | |
"\n", | |
"Make sure to have the latest version of PixieDust" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"# !pip install --upgrade pixiedust" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<br/> \n", | |
"\n", | |
"#### Import PixieDust\n", | |
"\n", | |
"Before, you can use the PixieDust library it must be imported into the notebook" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"import pixiedust" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<br/> \n", | |
"\n", | |
"## One way to create a scatterplot \n", | |
"\n", | |
"\n", | |
"#### Load CSV data into a dataframe \n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"#Load the csv\n", | |
"\n", | |
"path=\"cars.csv\"\n", | |
"df3 = sqlContext.read.format('com.databricks.spark.csv')\\\n", | |
" .options(header='true', mode=\"DROPMALFORMED\", inferschema='true').load(path)\n", | |
"df3.count()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"\n", | |
"#### Plot the data\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"from pyspark.sql.types import DecimalType\n", | |
"import matplotlib.pyplot as plt\n", | |
"from matplotlib import cm\n", | |
"import math\n", | |
"\n", | |
"maxRows = 500\n", | |
"def toPandas(workingDF): \n", | |
" decimals = []\n", | |
" for f in workingDF.schema.fields:\n", | |
" if f.dataType.__class__ == DecimalType:\n", | |
" decimals.append(f.name)\n", | |
"\n", | |
" pdf = workingDF.toPandas()\n", | |
" for y in pdf.columns:\n", | |
" if pdf[y].dtype.name == \"object\" and y in decimals:\n", | |
" #spark converts Decimal type to object during toPandas, cast it as float\n", | |
" pdf[y] = pdf[y].astype(float)\n", | |
"\n", | |
" return pdf\n", | |
"\n", | |
"xFields = [\"horsepower\"]\n", | |
"yFields = [\"mpg\"]\n", | |
"workingDF = df3.select(xFields + yFields)\n", | |
"workingDF = workingDF.dropna()\n", | |
"count = workingDF.count()\n", | |
"if count > maxRows:\n", | |
" workingDF = workingDF.sample(False, (float(maxRows) / float(count)))\n", | |
"pdf = toPandas(workingDF)\n", | |
"#sort by xFields\n", | |
"pdf.sort_values(xFields, inplace=True)\n", | |
"\n", | |
"fig, ax = plt.subplots(figsize=( int(1000/ 96), int(750 / 96) ))\n", | |
"\n", | |
"for i,keyField in enumerate(xFields):\n", | |
" pdf.plot(kind='scatter', x=keyField, y=yFields[0], label=keyField, ax=ax, color=cm.jet(1.*i/len(xFields)))\n", | |
"\n", | |
"#Conf the legend\n", | |
"if ax.get_legend() is not None and ax.title is None or not ax.title.get_visible() or ax.title.get_text() == '':\n", | |
" numLabels = len(ax.get_legend_handles_labels()[1])\n", | |
" nCol = int(min(max(math.sqrt( numLabels ), 3), 6))\n", | |
" nRows = int(numLabels/nCol)\n", | |
" bboxPos = max(1.15, 1.0 + ((float(nRows)/2)/10.0))\n", | |
" ax.legend(loc='upper center', bbox_to_anchor=(0.5, bboxPos),ncol=nCol, fancybox=True, shadow=True)\n", | |
"\n", | |
"#conf the xticks\n", | |
"labels = [s.get_text() for s in ax.get_xticklabels()]\n", | |
"totalWidth = sum(len(s) for s in labels) * 5\n", | |
"if totalWidth > 1000:\n", | |
" #filter down the list to max 20 \n", | |
" xl = [(i,a) for i,a in enumerate(labels) if i % int(len(labels)/20) == 0]\n", | |
" ax.set_xticks([x[0] for x in xl])\n", | |
" ax.set_xticklabels([x[1] for x in xl])\n", | |
" plt.xticks(rotation=30)\n", | |
"\n", | |
"plt.show()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"collapsed": true | |
}, | |
"source": [ | |
"<br/> \n", | |
"\n", | |
"## PixieDust way to create a scatterplot \n", | |
"\n", | |
"\n", | |
"#### Load remote CSV data into a dataframe \n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"cars = pixiedust.sampleData(\"https://github.com/ibm-cds-labs/open-data/raw/master/cars/cars.csv\") \n", | |
"cars.count()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"\n", | |
"#### Call display()\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false, | |
"pixiedust": { | |
"displayParams": { | |
"color": "year", | |
"handlerId": "scatterPlot", | |
"keyFields": "horsepower", | |
"kind": "resid", | |
"rendererId": "matplotlib", | |
"rowCount": "500", | |
"valueFields": "mpg" | |
} | |
} | |
}, | |
"outputs": [], | |
"source": [ | |
"display(cars)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"collapsed": true | |
}, | |
"source": [ | |
"<br/> \n", | |
"\n", | |
"## display() controls \n", | |
"\n", | |
"#### Renderers \n", | |
"\n", | |
"* [Bokeh](http://bokeh.pydata.org/en/0.10.0/index.html)\n", | |
"* [Matplotlib](http://matplotlib.org/)\n", | |
"* [Seaborn](http://seaborn.pydata.org/index.html)\n", | |
"* [Mapbox](https://www.mapbox.com/)\n", | |
"* [Google GeoChart](https://developers.google.com/chart/interactive/docs/gallery/geochart)\n", | |
"\n", | |
"#### Chart options\n", | |
"\n", | |
"* **Chart types**\n", | |
"* **Options**\n", | |
"\n", | |
"To learn more : https://ibm-cds-labs.github.io/pixiedust/displayapi.html" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false, | |
"pixiedust": { | |
"displayParams": { | |
"aggregation": "SUM", | |
"chartsize": "85", | |
"handlerId": "mapView", | |
"keyFields": "state", | |
"rendererId": "google", | |
"rowCount": "500", | |
"valueFields": "unique_customers" | |
} | |
} | |
}, | |
"outputs": [], | |
"source": [ | |
"# create another dataframe, in a new variable\n", | |
"df2 = sqlContext.createDataFrame(\n", | |
"[(2010, 'Camping Equipment', 3, 'Texas'),\n", | |
" (2010, 'Golf Equipment', 1, 'Florida'),\n", | |
" (2010, 'Mountaineering Equipment', 1, 'Colorado'),\n", | |
" (2010, 'Outdoor Protection', 2, 'Colorado'),\n", | |
" (2010, 'Personal Accessories', 2, 'Massachusetts'),\n", | |
" (2011, 'Camping Equipment', 4, 'Colorado'),\n", | |
" (2011, 'Golf Equipment', 5, 'California'),\n", | |
" (2011, 'Mountaineering Equipment', 2, 'California'),\n", | |
" (2011, 'Outdoor Protection', 4, 'California'),\n", | |
" (2011, 'Personal Accessories', 2, 'California'),\n", | |
" (2012, 'Camping Equipment', 5, 'Texas'),\n", | |
" (2012, 'Golf Equipment', 5, 'Massachusetts'),\n", | |
" (2012, 'Mountaineering Equipment', 3, 'Washington'),\n", | |
" (2012, 'Outdoor Protection', 5, 'Maine'),\n", | |
" (2012, 'Personal Accessories', 3, 'New York'),\n", | |
" (2013, 'Camping Equipment', 8, 'Maine'),\n", | |
" (2013, 'Golf Equipment', 5, 'Florida'),\n", | |
" (2013, 'Mountaineering Equipment', 3, 'Vermont'),\n", | |
" (2013, 'Outdoor Protection', 8, 'Vermont'),\n", | |
" (2013, 'Personal Accessories', 4, 'Massachusetts')],\n", | |
"[\"year\",\"category\",\"unique_customers\", \"state\"])\n", | |
"\n", | |
"# This time, we've combined the dataframe and display() call in the same cell\n", | |
"# Run this cell \n", | |
"display(df2)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [] | |
} | |
], | |
"metadata": { | |
"anaconda-cloud": {}, | |
"kernelspec": { | |
"display_name": "pySpark (Spark 1.6.0) Python 2", | |
"language": "python", | |
"name": "pyspark1.6" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 2 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython2", | |
"version": "2.7.11" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 0 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment