vabarbosa · May 10, 2017 05:37
diff --git a/notebooks with pixiedust - las vegas.ipynb b/notebooks with pixiedust - las vegas.ipynb
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Using Notebooks with PixieDust for Fast, Flexible, and Easier Data Analysis and Experimentation  \n",
    "\n",
    "> Interactive notebooks are powerful tools for fast and flexible experimentation and data analysis. Notebooks can contain live code, static text, equations and visualizations. In this lab, you create a notebook via the IBM Data Science Experience to explore and visualize data to gain insight. We will be using PixieDust, an open source Python notebook helper library, to visualize the data in different ways (e.g., charts, maps, etc.) with one simple call.  \n",
    "\n",
    "<center>\n",
    "![pixiedust](https://developer.ibm.com/clouddataservices/wp-content/uploads/sites/85/2017/03/pixiedust200.png)\n",
    "<br/>  \n",
    "</center> \n",
    "\n",
    "You may access the complete tutorial with step by step instructions here: [http://ibm.biz/pixiedustlab](http://ibm.biz/pixiedustlab)  \n",
    "  \n",
    "<br/>  \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Before, you can use the PixieDust library it must be imported into the notebook.\n",
    "\n",
    "import pixiedust"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# With PixieDust, you can easily load CSV data from a URL into a PySpark DataFrame in the notebook.\n",
    "\n",
    "inspections = pixiedust.sampleData(\"https://opendata.lasvegasnevada.gov/resource/86jg-3buh.csv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "pixiedust": {
     "displayParams": {
      "handlerId": "dataframe"
     }
    }
   },
   "outputs": [],
   "source": [
    "# With PixieDust's **`display`** API, you can easily view and visualize the data.\n",
    "\n",
    "display(inspections)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Filter to a subset of only Las Vegas restaurants.\n",
    "\n",
    "inspections.registerTempTable(\"restaurants\")\n",
    "lasDF = sqlContext.sql(\"SELECT * FROM restaurants WHERE city='Las Vegas'\")\n",
    "lasDF.count()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br/>  \n",
    "\n",
    "#### Number of Restaurants By Categories  \n",
    "\n",
    "1. Click the Chart dropdown menu and choose **Bar Chart**\n",
    "2. From the **Chart Options** dialog\n",
    "\t1. \tDrag the **`category_name`** field and drop it into the **Keys** area\n",
    "\t2. Drag the **`count`** field and drop it into the **Values** area\n",
    "\t3. Set the **# of Rows to Display** to 1000\n",
    "\t4. Click **OK**\n",
    "3. Click the **Renderer** dropdown menu and choose **bokeh**\n",
    "4. Toggle the **Show Legend** Bar Chart Option to show or hide the legend\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "pixiedust": {
     "displayParams": {
      "aggregation": "SUM",
      "handlerId": "barChart",
      "keyFields": "category_name",
      "legend": "false",
      "rendererId": "bokeh",
      "rowCount": "1000",
      "valueFields": "count"
     }
    }
   },
   "outputs": [],
   "source": [
    "# Number of restaurants by categories\n",
    "\n",
    "bycat = lasDF.groupBy(\"category_name\").count()\n",
    "display(bycat)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br/>  \n",
    "\n",
    "#### Average number of inspection demerits per category clustered by the inspection grade  \n",
    "\n",
    "1. Click the Chart dropdown menu and choose **Bar Chart**\n",
    "2. From the **Chart Options** dialog\n",
    "\t1. \tDrag the **`category_name`** field and drop it into the **Keys** area\n",
    "\t2. Drag the **`inspection_demerits`** field and drop it into the **Values** area\n",
    "\t3. Set the **Aggregation** to AVG\n",
    "\t4. Set the **# of Rows to Display** to 1000 \n",
    "\t5. Click **OK**\n",
    "3. Click the **Renderer** dropdown menu and choose **bokeh**\n",
    "4. Click the **Cluster By** dropdown menu and choose **inspection_grade**\n",
    "5. Click the **Type** dropdown menu and choose the desired bar type (e.g., **stacked**)\n",
    "\n",
    "#### Current demerits vs inspection demerits  \n",
    "\n",
    "1. From the **Chart Options** dialog\n",
    "\t1. Set the **Keys** to **`inspection_demerits`**\n",
    "\t2. Set the **Values** to **`current_demerits`**\n",
    "\t3. Set the **# of Rows to Display** to 1000\n",
    "\t4. Click **OK**\n",
    "2. Click the Chart dropdown menu and choose **Scatter Plot**\n",
    "3. Select **bokeh** from the **Renderer** dropdown menu\n",
    "4. Select **inspection_grade** from the **Color** dropdown menu\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "pixiedust": {
     "displayParams": {
      "aggregation": "AVG",
      "charttype": "stacked",
      "clusterby": "inspection_grade",
      "color": "inspection_grade",
      "handlerId": "scatterPlot",
      "keyFields": "inspection_demerits",
      "rendererId": "bokeh",
      "rowCount": "1000",
      "valueFields": "current_demerits"
     }
    }
   },
   "outputs": [],
   "source": [
    "display(lasDF)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br/>  \n",
    "\n",
    "#### Map the data  \n",
    "\n",
    "For the **Map** renderers, a token is required for them to display properly. Currently, PixieDust has two map renderers (i.e, Google, MapBox). For this section of the tutorial, you will be using the **MapBox** renderer and thus a [MapBox API Access Token](https://www.mapbox.com/help/create-api-access-token/) will need to be created if you choose to continue.\n",
    "\n",
    "The current data includes the longitude/latitude in the **`location_1`** field as a string like such: `POINT (-114.923505 36.114434)`\n",
    "\n",
    "However, the current **Map** renderers in PixieDust expect the longitude and latitude as separate number fields. The first thing you will need to do is parse the **`location_1`** field into separate longitude and latitude number fields."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Parse the location_1 field into separate longitude and latitude number fields\n",
    "\n",
    "from pyspark.sql.functions import udf\n",
    "from pyspark.sql.types import *\n",
    "\n",
    "def valueToLon(value):\n",
    "    lon = float(value.split('POINT (')[1].strip(')').split(' ')[0])\n",
    "    return None if lon == 0 else lon if lon < 0 else (lon * -1)\n",
    "\n",
    "def valueToLat(value):\n",
    "    lat = float(value.split('POINT (')[1].strip(')').split(' ')[1])\n",
    "    return None if lat == 0 else lat\n",
    "\n",
    "udfValueToLon = udf(valueToLon, DoubleType())\n",
    "udfValueToLat = udf(valueToLat, DoubleType())\n",
    "\n",
    "lonDF = lasDF.withColumn(\"lon\", udfValueToLon(\"location_1\"))\n",
    "lonlatDF = lonDF.withColumn(\"lat\", udfValueToLat(\"location_1\"))\n",
    "\n",
    "lonlatDF.printSchema()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "<br/>  \n",
    "\n",
    "#### View the map data  \n",
    "\n",
    "1. Click the Chart dropdown menu and choose **Map**\n",
    "2. From the **Chart Options** dialog\n",
    "\t1. Drag the **`lon`** field and the **`lat`** field and drop it into the **Keys** area\n",
    "\t2. \tDrag the **`current_demerits`** field and drop it into the **Keys** area\n",
    "\t3. Set the **# of Rows to Display** to 1000 \n",
    "\t4. Enter your access token from MapBox into the **MapBox Access Token** field\n",
    "\t5. Click **OK**\n",
    "3. Click the **kind** dropdown menu and choose **choropleth**\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "pixiedust": {
     "displayParams": {
      "aggregation": "SUM",
      "handlerId": "mapView",
      "keyFields": "lon,lat",
      "mapboxtoken": "pk.eyJ1IjoidmFiYXJib3NhIiwiYSI6ImNqMDE4a2lrZzA2NjkzMm94bXZqdTk4amYifQ.rfxH46T8UoWwxRegb3_X6g",
      "rendererId": "mapbox",
      "rowCount": "1000",
      "valueFields": "current_demerits"
     }
    }
   },
   "outputs": [],
   "source": [
    "display(lonlatDF)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "pySpark (Spark 1.6.0) Python 2",
   "language": "python",
   "name": "pyspark1.6"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
 }
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Using Notebooks with PixieDust for Fast, Flexible, and Easier Data Analysis and Experimentation \n",
	"\n",
	"> Interactive notebooks are powerful tools for fast and flexible experimentation and data analysis. Notebooks can contain live code, static text, equations and visualizations. In this lab, you create a notebook via the IBM Data Science Experience to explore and visualize data to gain insight. We will be using PixieDust, an open source Python notebook helper library, to visualize the data in different ways (e.g., charts, maps, etc.) with one simple call. \n",
	"\n",
	"<center>\n",
	"![pixiedust](https://developer.ibm.com/clouddataservices/wp-content/uploads/sites/85/2017/03/pixiedust200.png)\n",
	"<br/> \n",
	"</center> \n",
	"\n",
	"You may access the complete tutorial with step by step instructions here: [http://ibm.biz/pixiedustlab](http://ibm.biz/pixiedustlab) \n",
	" \n",
	"<br/> \n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"# Before, you can use the PixieDust library it must be imported into the notebook.\n",
	"\n",
	"import pixiedust"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"# With PixieDust, you can easily load CSV data from a URL into a PySpark DataFrame in the notebook.\n",
	"\n",
	"inspections = pixiedust.sampleData(\"https://opendata.lasvegasnevada.gov/resource/86jg-3buh.csv\")"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false,
	"pixiedust": {
	"displayParams": {
	"handlerId": "dataframe"
	}
	}
	},
	"outputs": [],
	"source": [
	"# With PixieDust's `display` API, you can easily view and visualize the data.\n",
	"\n",
	"display(inspections)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"# Filter to a subset of only Las Vegas restaurants.\n",
	"\n",
	"inspections.registerTempTable(\"restaurants\")\n",
	"lasDF = sqlContext.sql(\"SELECT * FROM restaurants WHERE city='Las Vegas'\")\n",
	"lasDF.count()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"<br/> \n",
	"\n",
	"#### Number of Restaurants By Categories \n",
	"\n",
	"1. Click the Chart dropdown menu and choose Bar Chart\n",
	"2. From the Chart Options dialog\n",
	"\t1. \tDrag the `category_name` field and drop it into the Keys area\n",
	"\t2. Drag the `count` field and drop it into the Values area\n",
	"\t3. Set the # of Rows to Display to 1000\n",
	"\t4. Click OK\n",
	"3. Click the Renderer dropdown menu and choose bokeh\n",
	"4. Toggle the Show Legend Bar Chart Option to show or hide the legend\n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false,
	"pixiedust": {
	"displayParams": {
	"aggregation": "SUM",
	"handlerId": "barChart",
	"keyFields": "category_name",
	"legend": "false",
	"rendererId": "bokeh",
	"rowCount": "1000",
	"valueFields": "count"
	}
	}
	},
	"outputs": [],
	"source": [
	"# Number of restaurants by categories\n",
	"\n",
	"bycat = lasDF.groupBy(\"category_name\").count()\n",
	"display(bycat)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"<br/> \n",
	"\n",
	"#### Average number of inspection demerits per category clustered by the inspection grade \n",
	"\n",
	"1. Click the Chart dropdown menu and choose Bar Chart\n",
	"2. From the Chart Options dialog\n",
	"\t1. \tDrag the `category_name` field and drop it into the Keys area\n",
	"\t2. Drag the `inspection_demerits` field and drop it into the Values area\n",
	"\t3. Set the Aggregation to AVG\n",
	"\t4. Set the # of Rows to Display to 1000 \n",
	"\t5. Click OK\n",
	"3. Click the Renderer dropdown menu and choose bokeh\n",
	"4. Click the Cluster By dropdown menu and choose inspection_grade\n",
	"5. Click the Type dropdown menu and choose the desired bar type (e.g., stacked)\n",
	"\n",
	"#### Current demerits vs inspection demerits \n",
	"\n",
	"1. From the Chart Options dialog\n",
	"\t1. Set the Keys to `inspection_demerits`\n",
	"\t2. Set the Values to `current_demerits`\n",
	"\t3. Set the # of Rows to Display to 1000\n",
	"\t4. Click OK\n",
	"2. Click the Chart dropdown menu and choose Scatter Plot\n",
	"3. Select bokeh from the Renderer dropdown menu\n",
	"4. Select inspection_grade from the Color dropdown menu\n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false,
	"pixiedust": {
	"displayParams": {
	"aggregation": "AVG",
	"charttype": "stacked",
	"clusterby": "inspection_grade",
	"color": "inspection_grade",
	"handlerId": "scatterPlot",
	"keyFields": "inspection_demerits",
	"rendererId": "bokeh",
	"rowCount": "1000",
	"valueFields": "current_demerits"
	}
	}
	},
	"outputs": [],
	"source": [
	"display(lasDF)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"<br/> \n",
	"\n",
	"#### Map the data \n",
	"\n",
	"For the Map renderers, a token is required for them to display properly. Currently, PixieDust has two map renderers (i.e, Google, MapBox). For this section of the tutorial, you will be using the MapBox renderer and thus a [MapBox API Access Token](https://www.mapbox.com/help/create-api-access-token/) will need to be created if you choose to continue.\n",
	"\n",
	"The current data includes the longitude/latitude in the `location_1` field as a string like such: `POINT (-114.923505 36.114434)`\n",
	"\n",
	"However, the current Map renderers in PixieDust expect the longitude and latitude as separate number fields. The first thing you will need to do is parse the `location_1` field into separate longitude and latitude number fields."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"# Parse the location_1 field into separate longitude and latitude number fields\n",
	"\n",
	"from pyspark.sql.functions import udf\n",
	"from pyspark.sql.types import *\n",
	"\n",
	"def valueToLon(value):\n",
	" lon = float(value.split('POINT (')[1].strip(')').split(' ')[0])\n",
	" return None if lon == 0 else lon if lon < 0 else (lon * -1)\n",
	"\n",
	"def valueToLat(value):\n",
	" lat = float(value.split('POINT (')[1].strip(')').split(' ')[1])\n",
	" return None if lat == 0 else lat\n",
	"\n",
	"udfValueToLon = udf(valueToLon, DoubleType())\n",
	"udfValueToLat = udf(valueToLat, DoubleType())\n",
	"\n",
	"lonDF = lasDF.withColumn(\"lon\", udfValueToLon(\"location_1\"))\n",
	"lonlatDF = lonDF.withColumn(\"lat\", udfValueToLat(\"location_1\"))\n",
	"\n",
	"lonlatDF.printSchema()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"collapsed": true
	},
	"source": [
	"<br/> \n",
	"\n",
	"#### View the map data \n",
	"\n",
	"1. Click the Chart dropdown menu and choose Map\n",
	"2. From the Chart Options dialog\n",
	"\t1. Drag the `lon` field and the `lat` field and drop it into the Keys area\n",
	"\t2. \tDrag the `current_demerits` field and drop it into the Keys area\n",
	"\t3. Set the # of Rows to Display to 1000 \n",
	"\t4. Enter your access token from MapBox into the MapBox Access Token field\n",
	"\t5. Click OK\n",
	"3. Click the kind dropdown menu and choose choropleth\n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false,
	"pixiedust": {
	"displayParams": {
	"aggregation": "SUM",
	"handlerId": "mapView",
	"keyFields": "lon,lat",
	"mapboxtoken": "pk.eyJ1IjoidmFiYXJib3NhIiwiYSI6ImNqMDE4a2lrZzA2NjkzMm94bXZqdTk4amYifQ.rfxH46T8UoWwxRegb3_X6g",
	"rendererId": "mapbox",
	"rowCount": "1000",
	"valueFields": "current_demerits"
	}
	}
	},
	"outputs": [],
	"source": [
	"display(lonlatDF)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": []
	}
	],
	"metadata": {
	"anaconda-cloud": {},
	"kernelspec": {
	"display_name": "pySpark (Spark 1.6.0) Python 2",
	"language": "python",
	"name": "pyspark1.6"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 2
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython2",
	"version": "2.7.11"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 0
	}